[13:07:51] <ebernhardson>	 \o
[13:14:59] <inflatador>	 <o/
[13:15:07] <inflatador>	 You're up early ;)
[13:15:39] <ebernhardson>	 this is pretty typical lately,  its about 6am, sun is up
[13:20:43] <inflatador>	 yeah, between that and my neighbor's dog, I rarely get to sleep past 6
[13:21:34] <inflatador>	 that stuff used to bother me when I was younger, not so much anymore
[13:51:58] <ebernhardson>	 pfischer: how did you create the page_weighted_tags_change topics? I know there is the kafka-topics tool somewhere but don't remember it being installed anywhere we had access to
[14:03:35] <pfischer>	 ebernhardson: I just ran the kafka client locally against the proxied version of kafka-main and called its createTopic method
[14:04:18] <pfischer>	 Since it complained that the topics already exist, the second time I ran it, I assumed that it worked.
[14:04:34] <ebernhardson>	 pfischer: yea it did, it just only did eqiad. I need to create the codfw topic
[14:04:49] <ebernhardson>	 looks like thats fine, i can pull https://apt.wikimedia.org/wikimedia/pool/thirdparty/confluent/c/confluent-kafka/confluent-kafka-2.11_1.1.0-1_all.deb somewhere and run it
[14:05:17] <ebernhardson>	 pfischer: i suppose to be clear, the eqiad.* and codfw.* topics are created, but only in the eqiad kafka hosts, the codfw kafka hosts don't have them
[14:05:46] <pfischer>	 Yes, I got that, sorry, I forgot about it.
[14:06:09] <ebernhardson>	 no worries, i forget things all the time :)
[14:06:10] <pfischer>	 I might as well create it the same way locally by tunnling to a codfw host
[14:06:15] <ebernhardson>	 sure
[14:07:04] <pfischer>	 BTW: there’s a bug in the SUP regarding weighted tags: We do not write them at all.
[14:07:19] <ebernhardson>	 :(  Good to find now at least
[14:09:22] <pfischer>	 ebernhardson: topics should exist in codfw now
[14:11:11] <ebernhardson>	 pfischer: yup i see them, thanks!
[14:11:29] <pfischer>	 Yeah, when decoding update events, we erase the weighted tags raw-field and store it in a dedicated property of the update event. When mapping to the ES request, we only process redirects separately and assume that everything else comes as part of the raw fields.
[14:12:37] <ebernhardson>	 wow, thats an oversight.  I somehow suspect i'm to blame for that :P
[14:12:49] <ebernhardson>	 (not thats who is particularly important)
[14:13:19] <pfischer>	 …and anyone who reviewed (which includes myself)
[14:14:13] <ebernhardson>	 i suppose as long as we get some testing in that covers the case and makes it hard to re-break, should be good
[14:17:41] <pfischer>	 I’ll at least fix that. Regarding the stats of # incoming recommendation vs # outgoing TAG_UPDATE events: There is quite a gap: of 4h worth of ~8k recommendations only 700 have a matching TAG_UPDATE. At least that’s my first calculation.
[14:19:49] <ebernhardson>	 unrelated but nifty, multiple reverts in gerrit no longer does `Revert Revert Revert Revert ...`.  Now its `Revert^4 ...`
[14:22:11] <ebernhardson>	 pfischer: ouch, so we are losing them at both sides of the pipeline
[14:34:51] <inflatador>	 ryankemper I disabled all the WDQS-related requestctl rules as we discussed on Tues...will keep an eye on CODFW but hoping the storm has passed
[15:09:30] <pfischer>	 ebernhardson: yes, I’am afraid so. Here’s the first fix: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/155
[15:15:57] <inflatador>	 pfischer ebernhardson no rush, but if y'all wanna look at the pool counters dashboard, David and I added some new panels that use prom instead of graphite. Let us know if the new stats look reasonable https://grafana-rw.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?orgId=1
[15:31:00] <ebernhardson>	 staging looks to have been running fine with private wikis overnight, turning it on for the prod instances
[15:41:27] <ottomata>	 ebernhardson: auto topic creation is on, so if a topic is produced to, it will be created :)
[15:41:45] <ottomata>	 (as long as you don't need special topic settings)
[15:42:39] <ebernhardson>	 ottomata: except we aren't producing to it yet, we are starting the consumers first. But the flink consumer fails to start when the topic doesn't exist
[15:43:35] <ebernhardson>	 next steps will be to migrate use cases from the previous topic-per-use-case to the new generalized topic
[15:44:53] <ottomata>	 ah k
[15:45:24] <ebernhardson>	 ottomata: quasi realtedly, how bad is it to change partitioning? We've realized that having 5 partitions for one of our topics is a kinda bad, because it's a prime number and we can't have an even distribution of work
[15:45:47] <ottomata>	 its only bad if you care about keys going to the same consumer
[15:46:13] <ottomata>	 adding partiions will cause key partitioner to send a key to a different partition than before
[15:46:27] <ottomata>	 otherwise (if you are just doing partitions for scale), it won't matter
[15:46:36] <ebernhardson>	 hmm, i think that will be ok but i'll have to check.  
[15:46:53] <ottomata>	 kafka consumer should rebalance automatically...not 100% sure what flink consumer does though
[15:47:46] <ebernhardson>	 it's just annoying from our side because we will have some of the partitions keep up, but 2 of the partitions lag because they are being processed by the same flink worker
[15:48:46] <ottomata>	 aye, you can't just have 5 flink workers?
[15:49:14] <ottomata>	 oh you can't reduce partitions i think. only increase
[15:50:02] <ebernhardson>	 in this case we don't have long term state, so the app can be destroyed and re-deployed. perhaps we could match them. I want to say we pondered 5 and rejected it for some reason but i have to look back to find why
[15:51:04] <ebernhardson>	 it's usually not a problem, i happened to think of it because i just deployed private wikis and it's churning through the backlog since it started from beginning. 3 partitions caught up but 2 are still climbing in lag
[15:51:28] <ottomata>	 aye
[16:01:13] <ebernhardson>	 ahh, we didn't do 5 because it seemed like a lot of unnecessary resources. We have 2 workers right now, one gets 2 partitions one gets 3, basically 50% more work on one of the two. 6 partitions would let it split evenly and allow scaling from 2->3->6 as needed in the future
[17:06:03] <inflatador>	 LMK if/when you'd like to scale up. I'm a big fan of throwing hardware at the problem ;)
[17:13:03] <ebernhardson>	 inflatador: lol, i guess i wasn't really thinking of it as a hardware problem, but a prime number problem. It keeps up fine, but it looks like the instances running 2xpartitions took ~15 minutes to process the backlog, and the instances running 3x partitions took ~35 minutes to clear the same backlog. Perhaps it's just slightly annoying that they don't do the same thing :P
[17:16:50] <inflatador>	 Oh yeah, I should've read closer. that's more about partitions, was thinking purely in terms of containers
[17:17:00] <inflatador>	 still, bring on those composite numbers ;P
[17:17:48] <ebernhardson>	 i think we just change the topic partition count to 6 and call it a day, but i have to double check that we don't do anything special
[17:56:49] <inflatador>	 lunch, back in time for pairing
[17:56:59] * ebernhardson sighs at threads...writing a test that a thing in a daemon thread is doing the right thing requires extra synchronization that we didn't need for non-tests :P
[18:00:48] <pfischer>	 ebernhardson: threads in which context?
[18:00:59] <ebernhardson>	 pfischer: the backfill thread in the cirrus reindex orchestrator
[18:01:15] <ebernhardson>	 i'm trying to write a test that verifies it performs the final backfill before shutting down
[18:02:09] <ebernhardson>	 pretty sure i fixed the problem, was like a 3 line change, and can see it work when running. But verifying with a test isn't completely obvious
[18:02:22] <pfischer>	 Was/is that orchestrator written in python?
[18:02:26] <ebernhardson>	 yea
[18:02:42] <ebernhardson>	 for now i'm going with time.sleep(.1) and pretending that counts as synchronization :P
[18:02:51] <ebernhardson>	 but it doesn't feel right
[18:03:19] <pfischer>	 Heh, yeah blocking waits feel wrong in multi-threading scenarios
[18:56:50] <inflatador>	 working my way thru the Elastic percentiles dashboard . This is the first panel using the prom metrics (needs some polishing still ) https://grafana-rw.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles-wip-prom-metrics?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&var-datacenter=eqiad+prometheus%2Fops&var-site=eqiad&var-k8sds=eqiad+prometheus%2F
[18:56:50] <inflatador>	 k8s&viewPanel=18&from=1723143394476&to=1723748194476
[18:56:59] <inflatador>	 errr, try this https://grafana.wikimedia.org/goto/QKmYh2CSR?orgId=1
[18:57:06] <ebernhardson>	 love modern urls :)
[18:58:50] <inflatador>	 LOL. Still need to get a handle on the variables, there are probably too many
[18:59:34] <ebernhardson>	 yea we should probably try to reduce the variable count if possible, i'm sure not all variables apply to all graphs, maybe split out into different dashboards?
[18:59:53] <ebernhardson>	 This dashboard in particular, despite the name, is basically the "anything that ever went wrong" dashboard
[19:01:21] <inflatador>	 Ah, good call...I should probably add some general dashboard cleanup to the AC on the task
[19:01:24] <ebernhardson>	 inflatador: one thought, with prometheus metrics split between eqiad and codfw prom collectors, i wonder if we should have the dashboard against thanos?
[19:01:57] <ebernhardson>	 i suppose i'm surprised that codfw is so low, since codfw should be running requests, but my suspicion is it's low because most requests against codfw are recorded in the other prom instance
[19:03:57] <ebernhardson>	 the codfw requests listed here are probably only write requests that go from eqiad->codfw, which should be pretty rare with SUP doing most of that work. Should only be archive updates and weighted tags
[19:06:01] <inflatador>	 ebernhardson we could do that, is there any downside to using Thanos?
[19:07:52] <inflatador>	 ` Thanos keeps 54 weeks of raw metric data, and 5 years for 5m and 1h resolution under normal circumstances. If there's object storage space pressure both raw metric data retention and 5m resolution might be shortened. `
[19:08:06] <inflatador>	 that sounds good enough for our purposes?
[19:12:40] <ebernhardson>	 inflatador: i think the difference is resolution, thanos has 5m instead of 1m? might also be slightly behind in ingestion and slightly slower to query, but probably not meaningful amounts
[19:13:49] <ebernhardson>	 random unrelated idea, our next team shirt could be:  I � Unicode
[19:16:06] <inflatador>	 LOL, I was having fun in my home lab yesterday..apparently Raspbian doesn't load/enable en_US.UTF-8 by default?
[19:17:01] <ebernhardson>	 does it just do plain ascii?
[19:19:28] <ebernhardson>	 btw the � is intentionally the unicode ? with a diamond, from the specials section. regularly seen when rendering is broken. i suppose tofu would work as well for the purposes
[19:19:54] <inflatador>	 I recognize that diamond ;)
[19:20:20] <inflatador>	 and yeah, it looks like none of the LC vars are set by default
[19:21:46] <inflatador>	 I don't know enough about locale, esp. the Debian implementation. Vaguely remember setting it for customers a long time ago
[19:32:27] <ebernhardson>	 i think i've only ever set it when running the interactive installer :P
[19:37:00] <inflatador>	 ebernhardson did you edit the p95 panel on the new dashboard? Just wondering as it's telling me "someone else has updated this dashboard"
[19:37:11] <ebernhardson>	 inflatador: nope, i haven't edited anything
[19:37:27] <inflatador>	 ACK, probably me then
[19:43:09] <inflatador>	 new P95 (https://grafana.wikimedia.org/goto/ar5NJhjIR?orgId=1 ) looks wildly different than old P95 (https://grafana.wikimedia.org/goto/X8FDJhjIg?orgId=1 ) ,trying to figure out why
[19:46:54] <ebernhardson>	 looking
[19:50:33] <ebernhardson>	 wouldn't say it's wildly different, but indeed the new numbers are perhaps 20% higher than the old ones. It could simply be an artifact of data collection
[19:51:40] <ebernhardson>	 in the old one everything was aggregated into a single bucket and statsd would spit out a p95 every minute of what it had seen, the prometheus one is doing query-time aggregation of per-pod data
[19:53:31] <ebernhardson>	 the per pod data is all over the place, not sure what that means 
[19:57:11] <ebernhardson>	 quasi relatedly, i wonder if there is benefit to having breakdowns by kubernetes_namespace,  as they are doing wildly different things (mw-api-int (sup), mw-api-ext (web), mw-jobrunner (updates), mw-web (web)
[19:57:42] <ebernhardson>	 but then there are too many lines :P
[20:04:27] <ebernhardson>	 inflatador: oh, do you just mean that it says 0.25ms, instead of 250ms? I guess i was mentally doing a *1000. Thats just units adjustment, but the spikes seem to line up between the two
[20:06:53] <inflatador>	 ah good point, let me see if I can change the unit display
[20:09:11] <inflatador>	 it says milliseconds already, but I guess we're dividing or multiplying by 1000 somewhere
[20:11:06] <ebernhardson>	 inflatador: i think the base data in prom is by second
[20:11:43] <ebernhardson>	 the buckets are 0.05, 0.1, ..., 0.5, 1.0, i think those are counts based on the # of seconds the request took
[20:16:16] <ebernhardson>	 the top answer here seems a reasonable explanation of whats going on: https://stackoverflow.com/questions/55162093/understanding-histogram-quantile-based-on-rate-in-prometheus
[20:16:56] <ebernhardson>	 also seems to explain why the data in prometheus might not be quite the same as the statsd data was (statsd was more accurate, where prometheus is doing some math-i-magic that introduces error via assumptions of linearity)
[20:17:44] <dr0ptp4kt>	 inflatador: i realized i didn't get back in touch with you about T372442 . i looked at the logs a little, but still don't really know. ebernhardson was, if i interpreted correctly, suggesting maybe trying to line up the graphs of the db lag and the locked up nodes. i won't have time to look more today, just wanted to acknowledge it's a thing.
[20:17:44] <stashbot>	 T372442: Determine if WDQS was affected by wikipedia editing outage/consider protections in similiar future scenarios - https://phabricator.wikimedia.org/T372442
[20:22:41] <inflatador>	 dr0ptp4kt cool, thanks for looking. I do know that we've at least 1 MySQL incident before that one and 1 after it . I don't know the relative severity of them
[20:23:45] <inflatador>	 but WDQS only blew up once, so that suggests they're not related I guess
[20:24:51] <ebernhardson>	 not just db lag, but when mediawiki was having trouble responding.  Since the alerts that were firing were about php-fpm worker saturation, maybe align with: https://grafana.wikimedia.org/d/U7JT--knk/mediawiki-on-k8s?orgId=1&viewPanel=84&from=1723546135191&to=1723692202636
[20:25:27] <ebernhardson>	 yea i tend to agree, seems like it was a coincidence and not about mwapi
[20:28:06] <inflatador>	 I'll update the ticket to make that a little more clear. dr0ptp4kt if it's OK w/you we can hold off on further investigation until WDQS blows up again
[20:28:22] <dr0ptp4kt>	 heh, sgtm inflatador 
[20:53:24] <ebernhardson>	 private wikis are no longer being written from php-land, made an edit in office wiki and can see things working as expected
[21:38:18] <pfischer>	 ebernhardson: Wohoo! That’s awesome!
[21:47:16] <inflatador>	 {◕ ◡ ◕}
[21:48:23] <dr0ptp4kt>	 nice ebernhardson 
[21:49:19] <dr0ptp4kt>	 okay, i emailed the team mailing list with the draft of the search backend email. your reviews are appreciated.