[00:19:17] it's still regularly restarting :S [00:24:12] basically https://phabricator.wikimedia.org/P60544 [00:38:09] * ebernhardson somehow hadn't considered that saneitizer will now subject us to the gamut of broken pages :P [06:38:04] just FYI spicerack upgraded on cumin1002 too [08:18:41] dcausse: I'm pushing our 1:1 back by 30' [08:20:26] np [09:30:25] dcausse: I'll be 3' late [09:30:30] ok [10:17:53] lunch [13:54:57] dcausse: Since Erik rolled out the SUP + Sanetizer last night, it fails because fetch_failure contains a field `raw_event` which holds a JSON-encoded version of an UpdateEvent. However, encoding complies with the schema linked to the update stream, which does not yet support PAGE_RERENDER_UPSERT as enum constant for change_type. Currently PAGE_RERENDER_UPSERT is used only inside the consumer. Now I’m wondering if we [13:54:57] should expand the schema or live with a workaround: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/119. Additionally we could rename the constant to start with an underscore to hint its internal nature. Or would you prefer this to be a public constant? [14:00:53] pfischer: wondering why it's not failing earlier? [14:01:35] https://schema.wikimedia.org/repositories/primary/jsonschema/development/cirrussearch/update_pipeline/update/current.yaml should be adapted no? [14:02:16] That would be the transparent way. On the other hand, the producer does not create UpdateEvents with changeType = PAGE_RERENDER_UPSERT [14:02:57] > why is it not failing earlier? [14:03:30] oh it's because we use the "fields" field of the update_event schema to apply flink serialization? [14:03:33] The consumer does not encode UpdateEvents, only to render raw_event to write to fetch_failure [14:06:39] hm... there could be 3 options I guess? the hack that you suggest but it might not be ideal to keep it for long, 2/ update the schema, 3/ another hack: filter PAGE_RERENDER_UPSERT from fetch_failures [14:07:37] I think ultimately option 2 is what we should target, seems cleaner and less error prone? [14:08:38] the confusion is probably that we use this update_event schema for several purposes [14:11:11] Yes, it feels not completely right to have that additional enum constant. However, alternatives might involve an additional flag `upsert` which in turn does not make sense in combination with a pure tag-update. [14:12:50] Or we could make the sink-converter smart and hope we can detect PAGE_RERENDER + upsert semantics from UpdateEvents due to other missing meta information. But that seems rather flaky [14:13:09] So I guess you are right, the cleanest way is to update the schema. [14:13:33] yes, I would go for promoting PAGE_RERENDER_UPSERT as part of the cirrussearch.update_pipeline.update schema with possibly a hack to unblock the pipeline quickly [14:14:06] Hm, then again: We have a field in UpdateEvent.meta `stream` which we could fill with “saneitize” [14:14:50] That would allow us to detect the necessary upsert behaviour at sink-time. [14:16:04] could be an option indeed, but feels slightly surprising? [14:19:32] Yes, would have to be documented. PAGE_RERENDER_UPSERT mixes two domains. All other constants so far describe the kind of source-UpdateEvent but _UPSERT is behaviour related to ES/OS. The sink already does interpret UpdateEvents (and maps REV_BASED_UPDATE to upsert semantic). [14:24:08] PAGE_RERENDER_UPSERT semantic is "fetch the latest revision and upsert" if it has possibly some value outside of the saneitizer perhaps it's interesting to keep it as a public enum constant if not perhaps some convention on meta.stream == 'saneitize' is OK [14:25:22] but the process that might read the error stream might have to be aware of this convention [14:31:15] I'd be leaning toward promoting PAGE_RERENDER_UPSERT (just because it sounds a bit more natural quickly thinking about the problem) but I don't have strong arguments against the other solutions you suggest, perhaps Erik has more opinions on this? [14:44:09] Sure, I’ll prepare the schema update, after all the schema still lives under development/ [14:51:27] yes, definitely, now is the best moment to adapt/break it to feet our needs [14:53:03] o/ [14:55:17] o/ [15:00:02] \o [15:00:57] o/ [15:01:29] not finding a great way to filter taskmanager logs from jobmanager ones in logstash :/ [15:03:10] i'm perhaps guilty of mostly reading taskmanager logs straight from kubectl (piped into fblog), but that doesn't have old stuff [15:03:21] using orchestrator.resource.name:flink-app-wikidata-taskmanager* but not a big fans of wildcards [15:04:04] yes these pods get renewed quite often :/ [15:48:08] shipping peters update to sup [15:57:38] * ebernhardson wonders if the connection pool usage metric could somehow report the peak pool usage over last minute, instead of the usage at that instant [15:58:03] i suppose that would also be misleading, but in a different way :P [16:00:44] Yeah, either way, the instant is misleading too [16:02:05] i don't see anything failing, looks to have been the last bit to get saneitizer started working [16:02:19] still need to make up some dashboards, and turn off the cirrus side scheduler [16:02:56] Awesome! Using a meter as metric, we would get multiple args for different windows [16:03:32] nice! [16:05:50] hmm, i'm seeing things marked cloudelastic.fixed, but not cloudelastic.old [16:06:06] not sure if thats data collection or if it's not wired up right, will poke around [16:14:48] sigh...I'm sure i tested this but sure enough, but getting old doc's even if i ask it to mark every other doc as old [16:14:53] s/getting/not getting/ [16:36:41] * ebernhardson apparently did not test it ... because the problem is it treats sequenceid=0 as false, and passing a 1 passes arguments in the wrong place [16:42:06] when mw parses the uri arg? [16:43:39] dcausse: nah, in our own code. null checking with == instead of === [16:43:51] apparently I'm not writing enough php lately [16:44:30] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1020288 [16:46:10] curious that phan did not catch the missing param [16:46:53] indeed, i suppose the return types aren't explicitly documented, but the detection should have worked here i imagine [16:47:49] might have seen the first return null from makeIsOldClosure perhaps and called it good enough [16:59:39] hmm, yea i do remember something about phan being more-permissible. As long as some types match, instead of requiring all possible types to match. maybe [17:08:43] always a bit worried to see how the new wdqs hosts perform badly compared to older ones (when writing to blazegraph), older hosts ingesting 2.5k triples/sec but new ones around 1.7k/s :( [17:10:46] heading out [17:11:19] yeah, it sucks! The next config up is almost double the $$$$ too ;( [17:30:15] lunch, back in time for pairing [17:58:23] back [18:28:11] 5m late to pairing [18:30:52] ACK [18:33:59] gehel we're in pairing if you wanna join [18:34:14] or at least, I am ;) [18:34:58] Sorry, I won't make it today [18:35:26] ACK, np [19:31:42] sigh, looks like we dropped buster from mirrors.wikimedia.org, but mwcli still uses numerous buster images that don't have obvious replacements with new versions [19:32:13] i guess i can derive hacked images that use official mirrors, but a bit tedious [19:34:22] mirrors.wikimedia.org *is* an official mirror :-) [19:35:20] what was dropped was the buster-backports suite, that's no longer included on the newest wikimedia /buster image, https://phabricator.wikimedia.org/T362518 [19:40:50] taavi: oh indeed, thanks for pointing that out. I wonder if these need those for anything.. [19:53:35] quick workout, back in ~30 [20:06:26] * ebernhardson hacks around with grep -v buster-backports for the moment [20:44:05] back [21:15:32] ryankemper tagged you on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1020375 (putting 2088 back in prod) if you have time to look [21:40:13] ebernhardson: seems like we regularly run into OOMs due to running out of meatspace (off-heap class-loading metadata). Looking at the flink dashboard this appears to be 256mb at the moment. The overall memory saturation is at 75% (of 3gb). Shall we simply increase `taskmanager.memory.jvm-metaspace.size` for now? We might have to investigate potential leaks if that still does not solve it. [22:00:20] pfischer: hmm, yea i suppose we need to. [22:05:32] Done. Let’s see how it works over night. [22:46:04] ebernhardson: You mentioned PoolCounter on Monday as an alternative to a zookeeper-based semaphore. I read the documentation, but a few questions remain. As far as I understand, we’d have to run a plain-vanilla Redis or its pool counter mod as server, which would have to live besides the job- and task-managers. Then we’d have to implement the TCP client in java. The client would then request locks for every request it [22:46:04] makes. This way we could only limit the number of requests in-flight but not the rate. So the question is: How to establish a constant rate across parallel insances? [22:46:44] All I can think of right now is this: At start, all instances fire at max_rate/replicas. [22:48:57] Then we use pool counter to request a lock for burst mode, that is: If an instance starts building up back-pressure, it would request a burst lock. With that it would temporarily double its rate, for a limited time, e.g. 1min.