[00:19:17] <ebernhardson>	 it's still regularly restarting :S
[00:24:12] <ebernhardson>	 basically https://phabricator.wikimedia.org/P60544
[00:38:09] * ebernhardson somehow hadn't considered that saneitizer will now subject us to the gamut of broken pages :P
[06:38:04] <volans>	 just FYI spicerack upgraded on cumin1002 too
[08:18:41] <gehel>	 dcausse: I'm pushing our 1:1 back by 30'
[08:20:26] <dcausse>	 np
[09:30:25] <gehel>	 dcausse: I'll be 3' late
[09:30:30] <dcausse>	 ok
[10:17:53] <dcausse>	 lunch
[13:54:57] <pfischer>	 dcausse: Since Erik rolled out the SUP + Sanetizer last night, it fails because fetch_failure contains a field `raw_event` which holds a JSON-encoded version of an UpdateEvent. However, encoding complies with the schema linked to the update stream, which does not yet support PAGE_RERENDER_UPSERT as enum constant for change_type. Currently PAGE_RERENDER_UPSERT is used only inside the consumer. Now I’m wondering if we 
[13:54:57] <pfischer>	 should expand the schema or live with a workaround: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/119. Additionally we could rename the constant to start with an underscore to hint its internal nature. Or would you prefer this to be a public constant?
[14:00:53] <dcausse>	 pfischer: wondering why it's not failing earlier?
[14:01:35] <dcausse>	 https://schema.wikimedia.org/repositories/primary/jsonschema/development/cirrussearch/update_pipeline/update/current.yaml should be adapted no?
[14:02:16] <pfischer>	 That would be the transparent way. On the other hand, the producer does not create UpdateEvents with changeType = PAGE_RERENDER_UPSERT
[14:02:57] <pfischer>	 > why is it not failing earlier?
[14:03:30] <dcausse>	 oh it's because we use the "fields" field of the update_event schema to apply flink serialization?
[14:03:33] <pfischer>	 The consumer does not encode UpdateEvents, only to render raw_event to write to fetch_failure
[14:06:39] <dcausse>	 hm... there could be 3 options I guess? the hack that you suggest but it might not be ideal to keep it for long, 2/ update the schema, 3/ another hack: filter PAGE_RERENDER_UPSERT from fetch_failures 
[14:07:37] <dcausse>	 I think ultimately option 2 is what we should target, seems cleaner and less error prone?
[14:08:38] <dcausse>	 the confusion is probably that we use this update_event schema for several purposes
[14:11:11] <pfischer>	 Yes, it feels not completely right to have that additional enum constant. However, alternatives might involve an additional flag `upsert` which in turn does not make sense in combination with a pure tag-update.
[14:12:50] <pfischer>	 Or we could make the sink-converter smart and hope we can detect PAGE_RERENDER + upsert semantics from UpdateEvents due to other missing meta information. But that seems rather flaky
[14:13:09] <pfischer>	 So I guess you are right, the cleanest way is to update the schema.
[14:13:33] <dcausse>	 yes, I would go for promoting PAGE_RERENDER_UPSERT as part of the cirrussearch.update_pipeline.update schema with possibly a hack to unblock the pipeline quickly
[14:14:06] <pfischer>	 Hm, then again: We have a field in UpdateEvent.meta `stream` which we could fill with “saneitize”
[14:14:50] <pfischer>	 That would allow us to detect the necessary upsert behaviour at sink-time.
[14:16:04] <dcausse>	 could be an option indeed, but feels slightly surprising?
[14:19:32] <pfischer>	 Yes, would have to be documented. PAGE_RERENDER_UPSERT mixes two domains. All other constants so far describe the kind of source-UpdateEvent but _UPSERT is behaviour related to ES/OS. The sink already does interpret UpdateEvents (and maps REV_BASED_UPDATE to upsert semantic).
[14:24:08] <dcausse>	 PAGE_RERENDER_UPSERT semantic is "fetch the latest revision and upsert" if it has possibly some value outside of the saneitizer perhaps it's interesting to keep it as a public enum constant if not perhaps some convention on meta.stream == 'saneitize' is OK
[14:25:22] <dcausse>	 but the process that might read the error stream might have to be aware of this convention
[14:31:15] <dcausse>	 I'd be leaning toward promoting PAGE_RERENDER_UPSERT (just because it sounds a bit more natural quickly thinking about the problem) but I don't have strong arguments against the other solutions you suggest, perhaps Erik has more opinions on this? 
[14:44:09] <pfischer>	 Sure, I’ll prepare the schema update, after all the schema still lives under development/
[14:51:27] <dcausse>	 yes, definitely, now is the best moment to adapt/break it to feet our needs
[14:53:03] <inflatador>	 <o/
[14:54:05] <pfischer>	 o/
[14:55:17] <dcausse>	 o/
[15:00:02] <ebernhardson>	 \o
[15:00:57] <dcausse>	 o/
[15:01:29] <dcausse>	 not finding a great way to filter taskmanager logs from jobmanager ones in logstash :/
[15:03:10] <ebernhardson>	 i'm perhaps guilty of mostly reading taskmanager logs straight from kubectl (piped into fblog), but that doesn't have old stuff 
[15:03:21] <dcausse>	 using orchestrator.resource.name:flink-app-wikidata-taskmanager* but not a big fans of wildcards
[15:04:04] <dcausse>	 yes these pods get renewed quite often :/
[15:48:08] <ebernhardson>	 shipping peters update to sup
[15:57:38] * ebernhardson wonders if the connection pool usage metric could somehow report the peak pool usage over last minute, instead of the usage at that instant
[15:58:03] <ebernhardson>	 i suppose that would also be misleading, but in a different way :P
[16:00:44] <pfischer>	 Yeah, either way, the instant is misleading too
[16:02:05] <ebernhardson>	 i don't see anything failing, looks to have been the last bit to get saneitizer started working
[16:02:19] <ebernhardson>	 still need to make up some dashboards, and turn off the cirrus side scheduler
[16:02:56] <pfischer>	 Awesome! Using a meter as metric, we would get multiple args for different windows
[16:03:32] <dcausse>	 nice!
[16:05:50] <ebernhardson>	 hmm, i'm seeing things marked cloudelastic.fixed, but not cloudelastic.old
[16:06:06] <ebernhardson>	 not sure if thats data collection or if it's not wired up right, will poke around
[16:14:48] <ebernhardson>	 sigh...I'm sure i tested this but sure enough, but getting old doc's even if i ask it to mark every other doc as old
[16:14:53] <ebernhardson>	 s/getting/not getting/
[16:36:41] * ebernhardson apparently did not test it ... because the problem is it treats sequenceid=0 as false, and passing a 1 passes arguments in the wrong place
[16:42:06] <dcausse>	 when mw parses the uri arg?
[16:43:39] <ebernhardson>	 dcausse: nah, in our own code. null checking with == instead of ===
[16:43:51] <ebernhardson>	 apparently I'm not writing enough php lately
[16:44:30] <ebernhardson>	 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1020288
[16:46:10] <dcausse>	 curious that phan did not catch the missing param
[16:46:53] <ebernhardson>	 indeed, i suppose the return types aren't  explicitly documented, but the detection should have worked here i imagine
[16:47:49] <dcausse>	 might have seen the first return null from makeIsOldClosure perhaps and called it good enough
[16:59:39] <ebernhardson>	 hmm, yea i do remember something about phan being more-permissible. As long as some types match, instead of requiring all possible types to match. maybe
[17:08:43] <dcausse>	 always a bit worried to see how the new wdqs hosts perform badly compared to older ones (when writing to blazegraph), older hosts ingesting 2.5k triples/sec but new ones around 1.7k/s :( 
[17:10:46] <dcausse>	 heading out
[17:11:19] <inflatador>	 yeah, it sucks! The next config up is almost double the $$$$ too ;(
[17:30:15] <inflatador>	 lunch, back in time for pairing
[17:58:23] <inflatador>	 back
[18:28:11] <ryankemper>	 5m late to pairing
[18:30:52] <inflatador>	 ACK
[18:33:59] <inflatador>	 gehel we're in pairing if you wanna join
[18:34:14] <inflatador>	 or at least, I am ;)
[18:34:58] <gehel>	 Sorry, I won't make it today 
[18:35:26] <inflatador>	 ACK, np
[19:31:42] <ebernhardson>	 sigh, looks like we dropped buster from mirrors.wikimedia.org, but mwcli still uses numerous buster images that don't have obvious replacements with new versions
[19:32:13] <ebernhardson>	 i guess i can derive hacked images that use official mirrors, but a bit tedious
[19:34:22] <taavi>	 mirrors.wikimedia.org *is* an official mirror :-)
[19:35:20] <taavi>	 what was dropped was the buster-backports suite, that's no longer included on the newest wikimedia /buster image, https://phabricator.wikimedia.org/T362518
[19:40:50] <ebernhardson>	 taavi: oh indeed, thanks for pointing that out. I wonder if these need those for anything..
[19:53:35] <inflatador>	 quick workout, back in ~30
[20:06:26] * ebernhardson hacks around with grep -v buster-backports for the moment
[20:44:05] <inflatador>	 back
[21:15:32] <inflatador>	 ryankemper tagged you on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1020375 (putting 2088 back in prod) if you have time to look
[21:40:13] <pfischer>	 ebernhardson: seems like we regularly run into OOMs due to running out of meatspace (off-heap class-loading metadata). Looking at the flink dashboard this appears to be 256mb at the moment. The overall memory saturation is at 75% (of 3gb). Shall we simply increase `taskmanager.memory.jvm-metaspace.size` for now? We might have to investigate potential leaks if that still does not solve it.
[22:00:20] <ebernhardson>	 pfischer: hmm, yea i suppose we need to.
[22:05:32] <pfischer>	 Done. Let’s see how it works over night.
[22:46:04] <pfischer>	 ebernhardson: You mentioned PoolCounter on Monday as an alternative to a zookeeper-based semaphore. I read the documentation, but a few questions remain. As far as I understand, we’d have to run a plain-vanilla Redis or its pool counter mod as server, which would have to live besides the job- and task-managers. Then we’d have to implement the TCP client in java. The client would then request locks for every request it 
[22:46:04] <pfischer>	 makes. This way we could only limit the number of requests in-flight but not the rate. So the question is: How to establish a constant rate across parallel insances?
[22:46:44] <pfischer>	 All I can think of right now is this: At start, all instances fire at max_rate/replicas.
[22:48:57] <pfischer>	 Then we use pool counter to request a lock for burst mode, that is: If an instance starts building up back-pressure, it would request a burst lock. With that it would temporarily double its rate, for a limited time, e.g. 1min.