[08:09:00] ryankemper: cool! Let's wait on that one and fingers crossed, we might increase the throttling on moving shards [08:09:11] well, might be "decrease" in this context [08:59:25] meal 2 + errand [10:03:29] lunch [11:10:55] time for some cooking [11:23:38] lunch [13:50:08] can't make any sense of the change prop metrics... [13:50:38] esp: [13:50:41] https://grafana-rw.wikimedia.org/explore?orgId=1&left=%5B%221624924800000%22,%221625615999000%22,%22codfw%20prometheus%2Fk8s%22,%7B%22expr%22:%22irate(sum(cpjobqueue_normal_rule_processing_count%7Brule%3D%5C%22cirrusSearchElasticaWrite-cpjobqueue-partitioned-mediawiki-job-cirrusSearchElasticaWrite%5C%22%7D%5B5m%5D))%22,%22requestId%22:%22Q-d852b6e9-5055-4e58-a7e2-fde6d0f87493-0A%22%7D%5D [13:51:10] "parse error at char 153: expected type instant vector in aggregation expression, got range vector" ? [13:51:43] ah sorry [13:52:04] https://grafana-rw.wikimedia.org/explore?orgId=1&left=%5B%221624924800000%22,%221625615999000%22,%22codfw%20prometheus%2Fk8s%22,%7B%22expr%22:%22irate(cpjobqueue_normal_rule_processing_count%7Brule%3D%5C%22cirrusSearchElasticaWrite-cpjobqueue-partitioned-mediawiki-job-cirrusSearchElasticaWrite%5C%22%7D%5B5m%5D)%22,%22requestId%22:%22Q-65e8a73d-96da-477c-ad41-b675de345c2d-0A%22%7D%5D [13:52:31] this is for a topic named cirrusSearchElasticaWrite [13:52:46] which has 3 partitions [13:53:03] not sure what it means, but it sure is weird [13:53:14] something happens on jul 2 4am [13:53:24] redistribiting the consumers [13:54:04] the lines are for partitions? [13:54:27] I suppose so, they're by pods [13:54:42] it looks like two pods went down, two got up [13:54:44] but I guess they're assigned a partition [13:55:50] what are the values here? [13:57:07] rate of processed jobs I think [13:57:42] that would mean that for some reason there are a lot more after pods were replaced [13:58:00] the topic was already backlogged [13:58:07] but that can be probably exaplained [13:58:23] right, new pods managed to pick up more than the old ones [13:58:37] so what's the problem? [13:58:50] problem I have is ruling out that change prop is not the bottleneck itself [13:59:28] ah, you're talking about the recent wikidata congestion issue [14:42:30] hmm, I seem to have two office hours at the same time now in my calendar [15:07:12] \o [15:07:28] o/ [15:10:45] o/ [15:11:23] giving on this indexing latencies for now, they're clearly claiming to doing more jobs but I cannot find evidences that there are more messages produced nor more indexing requests sent to elastic :( [15:11:42] *giving up for now [15:12:33] dcausse: yea i saw the same, the kafka topic metrics don't seem to line up with what i see on the job graph's [15:13:05] will ponder, but i don't have great ideas :S In theory we can bump parallelism but i hate solutions where i don't know whats wrong and don't fix anything... [15:14:12] we should perhaps tag Hugh or Petr they might have ideas [15:14:58] hmm, yea [16:40:27] dinner [16:42:47] * ebernhardson realizes he doesn't even have the password for cindy's gerrit account in the password manager :) [16:43:23] really i should make cindy properly vote -1 when the whole suite fails spectacularly...but thats going to take some time [17:09:31] huh, actully something odd happened inside cindy (git version upgrade?) and it didn't want to checkout patches until i set user.email and user.name global values inside the mwv instance