[09:25:14] Errand + lunch [10:38:40] lunch [13:29:57] https://gerrit.wikimedia.org/r/c/operations/puppet/+/952864 quick patch to turn off monitoring for wdqs1005, we're about to decom it [13:57:54] ^^ setting to WIP for now. Anyone know why we redirect bigdata/ldf to wdqs1005 specifically? Re: https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common/profile/trafficserver/backend.yaml#L198 [13:58:13] and do we need to change that to an active host? [14:55:39] \o [14:56:55] another quick PR to bump flink-app chart version if anyone has time to +1: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/952891 [17:02:23] ^^ pfischer if you're still around, that one is blocking the flink-zk deployment in dse-k8s [17:23:34] hmm, i'm thinking cancelling the reindex might be the right approach. Can see indexing latency has been building as more indexes move to the new mappings: https://grafana-rw.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?from=now-3M&orgId=1&to=now&viewPanel=23 [17:23:48] been consistently ~10ms for awhile, it's now up to 20ms and climbing [17:25:15] although at this point the biggest indexes like commonswiki, wikidata, dewiki and enwiki have all finished, so maybe most of the pain is already there. [17:26:08] Lunch/errands, back in ~1h [17:46:09] ebernhardson: I'm okay with cancelling the reindexing. Sorry I semi-broke stuff—and left it to you while I was on vacation to boot! Do you want to let it run for another 48 hours (assuming it doesn't get much worse) and talk more on Wednesday? Or you and I can schedule something to talk sooner. [18:37:01] Trey314159: it can probably continue running, other than the slower reindexing and increaced indexing latency i'm not seeing anything too bad (jobs aren't backlogging or taking twice as long) [18:38:45] ebernhardson: ok.. we can chat Wednesday, then [18:43:55] back [19:32:47] numbers are not great on the new mappings :( I pulled prod mappings for mediawiki.org (old) and en.wikipedia.org (new) mappings. Loaded 10k docs from the mediawiki.org dump into each a few times. The old mappings typically finish in 55s-60s. The new mappings take ~175s [19:33:00] Just on my laptop, but 3x is enough that it's probably signficant [19:34:29] helmfile doesn't want to deploy the new test flink-app in dse, getting out the hammer... [19:45:26] ebernhardson: yeah. not great. crap. [19:46:44] should I start work on reverting and un-re-indexing? [19:47:16] Trey314159: hmm, at tripple the indexing rate we probably do have to at least revisit it. I don't know if it rises to needing immediate action. [19:47:36] s/rate/latency/ [19:49:14] So, bad enough to fix, but not bad enough to earn an "I broke Wikipedia" t-shirt. (so close!) [19:52:41] ebernhardson: I set up a meeting for tomorrow—first thing for you, if that's okay. Move it to later if it's too early [19:52:49] sorry about all this [19:53:42] Sounds good, i suppose i should review and understand what exactly is changing in these mappings. Maybe something can be done :) [19:58:28] There are definitely other ways to do things. Either improvements to regexes or doing stuff in custom filters (I just didn't want to create/update more plugins). You don't need to worry about it too much for tomorrow (though look if you are curious!) If you can walk me through your testing scenario I can do some a/b testing and identify the problem areas... hopefully they aren't *all* problem areas! [20:04:36] Got a new deployment-chart patch up if anyone has time to look. Basically just fiddling around with flink config until it works https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/952927 [20:14:31] i dunno, seems plausible :P [20:22:51] ebernhardson thx! If/when you're ready to merge https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/951960 LMK [20:26:37] forbidden flink config key again ;( [20:32:28] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/952929 OK, one more "fiddlin' around" MR [21:02:03] OK, we have a working(?) deploy of rdf-streaming-updater with ZK [21:13:42] \o/