[07:01:50] ebernhardson: you can commit via IntelliJ, it has an option to auto organize imports on commit [09:09:58] created T300833 [09:09:58] T300833: Address "Log4shell" vulnerability in Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T300833 [09:10:13] I love Log4Shell name of the vulnerability :) [09:20:17] dcausse: anything I can help with when it comes to reconciliation? [09:22:41] good morning [09:22:52] o/ [09:23:01] also, it's really subjective :) [09:43:30] o/ [09:44:01] zpapierski: I have this patch for review https://gerrit.wikimedia.org/r/c/wikimedia/discovery/analytics/+/753791 but not super happy with it yet [09:44:40] rest will be mostly test&deploy [09:45:31] ok, looking [09:54:05] dcausse: what happens between troublesome events get dumped to kafka and when this DAG is executed? [09:54:15] is there any additional processing on HDFS? [09:54:29] (some dedup, perhaps?) [09:55:24] zpapierski: "troublesome events get dumped" is what stream? [09:55:34] all three [09:55:54] (though I don't remember what "lapsed action" means) [09:56:12] so they're imported into hdfs/hive and they're de-dupped based on the event id [09:56:32] Hi David [09:56:38] ejoseph: hey! [09:57:15] When can we continue [09:57:50] ejoseph: in 10 or 15 mins would that work for you? [09:58:20] zpapierski: something that might need some investigation is to understand why we get a lot more divergences on commons vs wikidata [09:58:51] looking at https://grafana-rw.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater?orgId=1 we're not very good with wcqs :/ [09:58:57] dcausse: sure [10:03:27] we get a lot more failed events because of 404 on commons so perhaps that explains but still.. [10:06:01] zpapierski: "lapsed action" is "late events", Andrew did not like to use "event" in the same of the stream :) [10:08:21] those 404s relate to missing mediainfo slot? didn't we take care of that? [10:09:20] these 404s seem to relate to eventual consistency, I've pushed a patch to increase the retry delay of recent events [10:09:38] defaults to 10sec but we can try 15sec to see if it has impact [10:09:57] weird, I remember that those 10s were somewhat liberal anyway... [10:10:24] yes... [10:12:45] ejoseph: I'm in meet.google.com/eek-ziau-djx if you want to continue [10:16:19] gehel: can you submit this? https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/753426 [10:16:50] dcausse: sorry i might need 20 more minutes [10:16:58] kk [10:20:08] zpapierski: looking [10:20:24] zpapierski: done [10:20:30] thx! [10:52:26] dcausse: i'm back [10:53:18] ejoseph: sorry, can we postpone to 2pm? have to go out in a couple minutes [10:53:45] Ok [10:53:58] cool, I'll ping you by then [10:54:46] lunch [10:55:05] ejoseph: we can continue with concurrency in Java instead if you wish [10:56:06] http://meet.google.com/eek-ziau-djx [11:15:36] Lunch [11:41:37] lunch [13:03:02] ejoseph: I'm around, ping me if you want to continue [13:03:59] oops, forget about that^, just saw your email [14:02:23] dcausse, zpapierski: meeting: https://meet.google.com/dyi-sopm-ihj [14:02:27] oops [14:05:51] Greetings! [14:22:59] dunno if anyone got my last message, but I'm in. Having dock issues at the moment [14:24:49] rebooting to hopefully fix a network config problem [14:25:33] o/ [14:38:29] OK, back [15:58:26] \p [15:58:28] \o even [16:12:49] o/ [16:17:05] * ebernhardson is still trying to find something that says this isn't the actual delivered concurrency to ElasticaWrite: https://grafana.wikimedia.org/goto/0v1yWzanz [16:17:22] because if so ... i guess we need more partitions or something. [16:17:39] but the partioner doesn't support it :( [16:19:12] the 300 concurrency is per dc partition? [16:20:45] dcausse: 100 per partition, so each of the three lines in that graph should hit 100 if we are cap'd out [16:21:32] that means the consumer there is able to pull and wait for 100 messages to be processed [16:22:25] yes, i'm wondering if the problem is that nodejs is single threaded (i think, not 100%) and the pod has to serve lots of things? But i would have expected that to be noticable for more than just us [16:23:07] if we wanted 100 concurrent requests at 0.3s per req, that would be asking nodejs to spit out a job every 1ms from 1 core. too much likely [16:23:13] if there's a single consumer for multiple topics it might also be the kafka driver? [16:23:47] no clue how changeprop deals with kafka sadly [16:24:33] hmm, could be that as well. Depends on how it's round-robining all the topics its listening to [16:30:49] i guess i need to make a ticket and hope core team can look into it...could figure it out but i supect it would be a significant time sink for us to delve into that [16:31:20] yes agreed [16:31:39] numbers make no sense to me [16:31:54] esp. throughput on the partition 1 [16:31:59] https://grafana-rw.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&from=now-7d&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite&var-consumer_group=All [16:32:28] oh, it's not catching up in codfw :S [16:33:08] no :/ [16:33:39] the max throughput achived when the sanitizer was enabled is not achived now [16:33:49] it's feeling like overload on the parts distributing jobs (and echos my comments from last time that increasing concurrency was an arms race, where we are juts stealing other jobs runtime with higher concurrency numbers) [16:34:13] yes [16:34:45] it should expose its concurrency numbers through prometheus, that'd be easier to understand where the bottleneck is [16:35:29] yes, i asked last time around but they wanted to go with (50th percentile latency * jobs retired/s) instead of tracking concurrency directly. Not clear why, i like direct numbers :) [16:35:59] i feel like that eqution is missing the downtime between requests though, the requests don't fire at th eexact moment the last ended [16:36:05] oh inferring concurrency from throughput? [16:36:08] yes [16:36:14] hm... [16:37:01] it's likely it or pure nodejs limitations [17:39:30] gehel and/or ryankemper , do you have time to join https://meet.google.com/iqe-wcuz-mpn ? dcausse and I are looking at rebuilding a deb pkg, https://gerrit.wikimedia.org/r/c/operations/software/elasticsearch/plugins/+/755750 [17:42:39] inflatador: yup, one sec [17:45:57] * ebernhardson appreciates code rviews on this banned usernames patch, i feel significantly less bad looking at it now after a few rounds :) [18:15:18] Sorry, dinner time here [18:23:58] dcausse and ejoseph , the ES6.8 pkgs are built/uploaded , let us know if they do not work [18:32:18] ebernhardson: brian and I took a look at the patch and it looks good, but we'll defer to gehel on whether the java half of the changes is good to go [18:36:53] sounds good [18:50:57] lunch, back in ~1 h [19:02:20] ryankemper: which CR are you referring to ? [19:05:08] gehel: likely https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/757993 because i put a cleanup patch on top of it [19:05:28] (they touch the same lines of runBlazegraph.sh, would be edit conflicts otherwise) [19:25:42] huh, i hadn't thought about it until seeing the question on office wiki, but there was no all-hands this year. [19:26:00] (but there's still time :) [19:26:47] gehel: finishing up lunch, will be 5’ late to pairing [19:29:15] will be 2' late, Oscar's injections [19:35:00] ryankemper, inflatador: meeting? https://meet.google.com/ckm-dmmh-opt [19:35:12] Or if you're busy and don't need me, we can cancel [19:48:28] gehel sorry I missed you, we can get get together tomorrow if need be. I think ryankemper will work on bringing some of the IAD elastic hosts online later [19:48:47] inflatador: we're still in there if you want to join [19:49:57] ah cool, BRT then! [19:50:06] also https://i.redd.it/2qm6k17ammf81.jpg [20:44:56] ebernhardson: have a moment to help brian and i troubleshoot `elasticsearch-oss` failures? we thought we'd fixed the dependency issue but, it appears not quite :x [20:45:17] we're in https://meet.google.com/ckm-dmmh-opt rn [20:51:32] ryankemper: sure [20:52:00] https://phabricator.wikimedia.org/P20138 [20:54:06] dpkg-query -L ${pkgname) [20:55:41] https://phabricator.wikimedia.org/P20141 ryankemper ebernhardson [20:56:13] https://phabricator.wikimedia.org/T276198#6874170 [21:21:48] not to distract, but created https://phabricator.wikimedia.org/T300928 to capture some of our pain points and hopefully work thru them [21:22:40] ebernhardson: inflatador: https://phabricator.wikimedia.org/P20146 [21:50:13] ebernhardson: inflatador: https://gerrit.wikimedia.org/r/c/operations/puppet/+/759617 [22:00:18] ebernhardson: inflatador: https://phabricator.wikimedia.org/P20151 [22:53:46] later all, thanks for the help today!