[00:01:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [00:11:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [02:03:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [02:08:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [02:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [04:29:41] PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_immediate on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:13:01] (03CR) 10AKhatun: [C: 03+1] "Looks great. Can't wait for the spark3 upgrade!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747508 (https://phabricator.wikimedia.org/T258834) (owner: 10Joal) [06:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [07:59:09] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Observability-Alerting: Apparent latency warning in 90th centile of eventgate-logging-external - https://phabricator.wikimedia.org/T294911 (10Joe) @BTullis I'm not fully convinced that slowness in connection to the edge would... [08:13:05] PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:18:28] hello folks [09:18:39] I created https://wikitech.wikimedia.org/wiki/User:Elukey/Misc/Bigtop to list basic commands to rebuild packages in bigtop [09:56:40] 10Data-Engineering, 10Data-Engineering-Kanban, 10Generated Data Platform, 10Image-Suggestion-API, 10Image-Suggestions: Update HiveToCassandra for variable substitution and HQL from files loading - https://phabricator.wikimedia.org/T297934 (10JAllemandou) [09:58:46] Thanks elukey for the writing :) [09:59:05] btullis: Am I right in assuming that we're not going to swap AQs clusters before holidays? [09:59:22] btullis: Oh and excuse me - Hi first :) [10:03:32] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Add Jenkins job for gobblin-wmf jar release to archiva - https://phabricator.wikimedia.org/T297938 (10JAllemandou) [10:07:20] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Update refinery gobblin jars to use new gobblin-wmf jars and update puppet gobblin jobs - https://phabricator.wikimedia.org/T297939 (10JAllemandou) [10:10:59] Hi joal. Well it would be a shame not to, so I'm ready to pool aqs1010 now if you're happy for me to go ahead. [10:11:11] sure btullis - let's try [10:11:19] +1 [10:12:14] btullis: actually let's wait [10:12:24] OK. [10:13:01] btullis: we can do one host today, byt next week almost all of us are off, and the week after next as well - if we wish to let the single host bake for some time, we won't have a slot to finalize the move [10:13:22] With that, it's probably safer to do it after holidays, would you agree? [10:13:37] It's not that I wouldn't like to have it done before, just a matter of safety :) [10:14:30] FYI I'm still here for a half day on Monday and all day Tuesday. I'm also on ops week until next Thursday. [10:15:28] I agree that it's safer to wait, but I'd be keen to pool the test host even for a short while today to see if it handles traffic correctly. Then we can decide whether or not to depool it before end of play today. [10:15:47] works for me btullis :) [10:16:14] one host pooled for testing would be great, worst case scenario we depool it during the holidays if anything happens (I'll watch as well) [10:16:31] Great! Thanks elukey. [10:17:06] perfect - let's pool it and let it be for the holidays [10:17:47] OK, here is the current state of aqs1010. [10:17:52] https://www.irccloud.com/pastebin/rR1TmjBK/ [10:19:25] btullis: as we're getting closer and closer to deploy, we should also care why we don't have logs for the new hosts on logstash - this will become very important [10:20:13] joal: Ah yes, I saw that you created a ticket for that. Will pick it up and work on it today. [10:20:28] Thank you btullis <3 [10:20:50] So I'm about to run `confctl pool dc=eqiad,cluster=aqs,name=aqs1010.eqiad.wmnet` [10:20:51] Does that look right elukey? [10:21:13] lemme check, I recall a different syntax [10:22:17] sudo -i confctl select name=aqs1010.eqiad.wmnet set/pooled=yes [10:22:26] this should be sufficient from puppetmaster1001 [10:22:46] ('get' works fine as expected) [10:23:08] OK, thanks. [10:24:21] I'll also be doing a `tail -f /srv/log/aqs/syslog.log` and `tail -f /var/log/cassandra/system-a.log` and `tail -f /var/log/cassandra/system-b.log` [10:25:09] and watching this graph for any 5xx errors: https://grafana.wikimedia.org/d/000000526/aqs?viewPanel=27&orgId=1 [10:26:35] Got a warning message, but is that normal? [10:26:40] https://www.irccloud.com/pastebin/2OVfer0L/ [10:28:53] yeah it is to tell you that something was logged in the sal, all good [10:29:17] Thanks. [10:30:47] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Switch over the Cassandra AQS cluster to the new hosts - https://phabricator.wikimedia.org/T297803 (10BTullis) We have pooled the first of the new hosts. ` btullis@puppetmaster1001:~$ sudo -i confctl select name... [10:33:03] gehel: hello! would you have aminute for java-dependency question? [10:33:24] gehel: I have duplication between [stax:stax-api:1.0.1, xml-apis:xml-apis:1.4.01] [10:33:34] Any prefence in the one I should keep/exclude? [10:33:55] good question [10:34:26] xml-api is younger (2011), StAX is from 2006 - but I do't know if that comes to play :S [10:34:37] stax:stax-api was moved : https://mvnrepository.com/artifact/stax/stax-api [10:35:16] Ok I assume that is the one I should exclude [10:35:48] I would keep the newer one, so exclude stax [10:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [10:36:11] Thanks gehel :) [10:36:14] testing [10:37:58] aqs1010 is looking good to me. I can't exactly see access logs, but I can see traffic between this host and various restbase servers if I do: `sudo tcpdump port 7232` [10:40:55] 10Data-Engineering: Send cassandra3 (new hosts) logs to logstash - https://phabricator.wikimedia.org/T297460 (10BTullis) p:05Triage→03High a:03BTullis I'll have a look at this, because I agree that it would be really beneficial to get this sorted before completing the migration. Is it this particular dash... [10:53:30] btullis: confirmed, on the host there is also `httpry` that is nice to traffic dump (basically a nicer tcpdump) [10:55:53] Ooh, I haven't ever used that. Will check it out. [11:01:42] 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, 10SRE, and 3 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) [11:04:58] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Switch over the Cassandra AQS cluster to the new hosts - https://phabricator.wikimedia.org/T297803 (10BTullis) We can see network traffic arriving at the host from restbase and being returned. Confirmed with `tc... [11:06:22] Yep, I like `httpry` - How have I never used that before? I see we also have `ngrep` available, which is also really good. [11:11:14] More good news. I finally found out what I was doing wrong with the druid build yesterday. So I now have a new build which definitely has the right jars in it. [11:11:19] https://www.irccloud.com/pastebin/Uznb8o5n/ [11:13:48] nice :) [11:18:45] !log updating reprepo with new druid packages for buster-wikimedia to pick up new log4j jar files [11:18:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:20:19] New packages ready for testing on an-test-druid1001. [11:20:23] https://www.irccloud.com/pastebin/ZUmDtne0/ [11:20:33] !log btullis@an-test-druid1001:~$ sudo apt-get install druid-broker druid-common druid-coordinator druid-historical druid-middlemanager druid-overlord [11:20:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:20:49] (03PS1) 10Joal: Add network_internal_flows to gobblin netflow job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/748099 (https://phabricator.wikimedia.org/T263277) [11:22:48] `btullis@an-test-druid1001:~$ sudo systemctl status druid-*` shows that all services are restarted automatically during the packages update, but that they started successfully. [11:23:06] 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, 10SRE, and 3 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) Adding question here in addition to the CR: For druid ingestion we have 2 jobs, the first ingests all colu... [11:24:19] (03PS4) 10Joal: exclude conflicting dependencies [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/730282 (owner: 10ODimitrijevic) [11:25:09] gehel: I updated --^ to make gobblin happy with it - review when you wish for a merge :) [11:28:31] 10Data-Engineering: Send cassandra3 (new hosts) logs to logstash - https://phabricator.wikimedia.org/T297460 (10JAllemandou) > Is it this particular dashboard cassandra_aqs upon which you're expecting to see cassandra logs @JAllemandou ? Yes @BTullis :) [11:30:36] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Switch over the Cassandra AQS cluster to the new hosts - https://phabricator.wikimedia.org/T297803 (10JAllemandou) \o/ [11:34:46] I'm happy that Druid works with the new packages, so I plan to do a rolling install of the new packages. Since the `apt-get install` will restart all of the services on a host, I'll do them one at a time and make sure that the cluster is fine before proceeding to the next. I'll start with the analytics cluster and then do the public cluster afterwards. Happy with this approach elukey? joal? Anything else I should be careful [11:34:46] about? [11:36:52] Good for me btullis - the restarts will lead to cache full reload, but I don't see that as problemtic [11:39:52] +1 - I will watch the coordinator UI and wait until the segments are all available on one node before proceeding. [11:41:01] btullis: remember that the public cluster is behind an LVS IP, so depool/pool are needed before/after (to avoid impacting traffic) [11:41:17] the rest looks good, maybe upgrade one node and leave it running for a bit before the rest [11:41:24] (like a couple of hours) [11:41:37] OK, thanks. Will do. [11:41:58] btullis / joal: sorry about the mess I left in ops duty yesterday. I'm cleaning everything [11:42:19] there's a persistent error in refine due to the use of "keep" instead of "keep_all" on a struct, easy fix and then a deploy [11:42:30] there's also a new wiki that I'll add to the pageview allow list [11:42:44] no worries milimetric - let me know if I can help [11:43:01] nono, easy fixes, just making sure you're not already bothered [11:43:02] Cool, no worries. Thanks milimetric. Same here. Let me know if I can help with anything. [11:47:34] 10Data-Engineering, 10Generated Data Platform: Set up regular-repairs for AQS cassandra cluster tables - https://phabricator.wikimedia.org/T297944 (10JAllemandou) [11:48:20] milimetric: I was waiting for either you or Ben to pick up - I'd have talked about it later if no one had moved :) [11:48:40] milimetric: I'm trying to be disciplined to not pickup on alerts too fast :) [11:49:55] joal: I'm trying to be disciplined about picking up on alerts fast *enough* :-) [11:50:12] thank you <3 :) [11:50:32] joal: I woke up this morning, read my emails, panicked that I messed up your morning, ran over here to try to fix it [11:50:47] so really, thank you so much for giving me the chance :) [11:50:53] it makes me feel better [11:51:49] milimetric: thanks a lot - please don't panick, and also be gentle on yourself and time your time for the fies and all :) [11:52:08] s/time your time/take your time/ [11:52:17] All segments fully available from all datasets and the historical service is reporting nothing to load on an-druid1001, so all looks good on that front. an-coord1002 is the new coordinator. [11:53:15] I will leave this host for a while as elukey advised, but in parallel start work on the public druid cluster. I will depool druid1004 and then update the packages and pool it again if all is well. [12:02:14] woah... there's like 10 new projects...? [12:05:23] hm... ok, I could use a bit of advice. So the pageview allow list showed pnb.wikiquote.org, an.wikiquote.org, arz.wikiquote.org, azb.wikiquote.org, ce.wikiquote.org, fiu-vro.wikiquote.org, fy.wikiquote.org, ig.wikiquote.org, mwl.wikiquote.org, tl.wikiquote.org as exceptions [12:05:32] these are all either nonexistent or in the incubator [12:06:09] so there must be some bug in this pipeline, perhaps introduced recently? [12:06:11] hm [12:06:26] (not on our side, on how routing happens?) [12:06:50] let's check for the pageviews in bot hpageview and webrequest for those project to try to assess [12:07:07] milimetric: wanna pair on that? [12:07:14] i mean, it wouldn't trigger the alert unless there were requests, yeah sure joal, cave? [12:07:23] OMW [12:07:41] I'll join you, because this pipeline is new to me. [12:07:51] come on in btullis [12:08:32] eqi stat1008 [12:08:34] woops [12:32:36] !log Upgraded druid packages, with pool/depool on druid1004 [12:32:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:45:19] (03PS1) 10Milimetric: Retain struct correctly with keep_all [analytics/refinery] - 10https://gerrit.wikimedia.org/r/748115 [12:45:38] joal: if you could sanity check that, that'd be helpful ^ [12:52:55] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Retain struct correctly with keep_all [analytics/refinery] - 10https://gerrit.wikimedia.org/r/748115 (owner: 10Milimetric) [13:08:33] 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, 10SRE, and 3 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10BTullis) I believe that it is not necessary to refine this data. [13:24:58] 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, 10SRE, and 3 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) In theory there should not be any PII data, but it would be safer to sanitize is nonetheless. As the data is m... [13:30:43] (03CR) 10Gehel: [C: 03+1] "LGTM" [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/730282 (owner: 10ODimitrijevic) [14:03:35] milimetric: reviewing your patch now - there are no other keep_all in the entire file - feeling weird - will read the underlying code [14:13:39] PROBLEM - Check unit status of refine_event_sanitized_analytics_immediate on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [14:44:47] milimetric: I have read the underlying code and using keep_all should do (https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/SanitizeTransformation.scala#L637) [14:45:28] joal: yeah, it looked like that to me too, but it's also used for keeping the whole dataframe, and it's *never* used anywhere else in the allow list for keeping a struct, so I was like... hm... I don't like being first [14:45:53] it's deployed and seems to have worked for the latest run. I'm in the middle of trying to figure out how to properly rerun the failed hours [14:46:01] yeah milimetric, same filling, that's why I read the code :) [14:46:09] ack milimetric - thanks a lot [14:46:26] it's a major pain, in my opinion, to rerun refine sanitize, and I don't understand why we do it so differently from normal refine [14:48:32] oh, btullis: have you rerun refine sanitize before? Let me know if you want to do it together [14:49:41] I thought I had, but perhaps I haven't. Yes I'd appreciate looking at it together, if that's OK. bc? [14:52:42] milimetric: ^ [14:52:53] omw cave btullis [15:08:41] joal: oops: 21/12/17 15:08:07 ERROR Client: Application diagnostics message: User class threw exception: java.lang.IllegalArgumentException: 'keep_all' tag is not permitted in sanitization allowlist. (normalized_host: keep_all) [15:09:17] we both got tricked, lemme know if you want to debug together, I'm not sure exactly what to do, maybe just turn off sanitize for this dataset for now? [15:11:48] yep, lines ~ 400 in SanitizeTransformation.scala definitely don't allow keep_all, despite the fact that it's used later. So that seems like a bug one way or another. Just not sure if it was tried and we decided it didn't work or if we just forgot to remove the code that doesn't allow keep_all [15:15:53] (03PS1) 10Milimetric: Remove mediawiki_skin_diff from allow list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/748139 [15:16:03] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Remove mediawiki_skin_diff from allow list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/748139 (owner: 10Milimetric) [16:10:08] (03CR) 10ODimitrijevic: [C: 03+1] exclude conflicting dependencies [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/730282 (owner: 10ODimitrijevic) [16:30:46] milimetric: joal [16:30:47] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Event_Sanitization#Allowlists [16:30:53] keep_all is not allowed for analytics data [16:30:58] each field has to explicitly listed for keeping [16:31:21] there are two allowlists for this very reason [16:31:24] cc mforns ^ [16:32:04] see also static_data/sanitization/README.md [16:32:04] https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/static_data/sanitization/ [16:38:59] hear hear [16:39:00] Can I help with anything to do with this? [16:39:21] also, can I help? [16:41:42] milimetric: sorry for that issue, I merged it without noticing that normalized_host was a struct [16:41:47] my bad [16:42:11] we should specify all fields within it that we want to keep [16:42:30] do you want to pair? [16:44:27] mforns / ottomata: ok, but then this line should be removed or changed: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/SanitizeTransformation.scala#L640 [16:45:18] milimetric: this line is needed for the other sanitization allow list, where keep_all is allowed no? [16:45:19] mforns: ok, so I guess I can just explicitly "keep" all the inner fields. I'll do that, deploy, and rerun sanitize [16:46:27] it's very confusing, it tricked both me and Joseph when we were trying to figure out what to do [16:46:39] hm, I'm sorry [16:47:24] no need to apologize, just talking about the code and how I see it, maybe I'm wrong [16:48:55] the reasoning behind is that listing all fields and subfields is annoying, thus we want to do it only when it's necessary: so we allow keep_all for the sanitization list that we control (production), but we don't allow keep_all for the analytics allowlist, because we don't control what composed fields contain. [16:49:29] people could add new fields to streams, and they would be automatically kept idefinitely [16:49:55] that part makes perfect sense, but there should be either a schema for the allow list checked at commit time, or something that would prevent these kinds of mistakes [16:50:08] like a build / validate step [16:50:26] it would easily pay for itself even with one mistake [16:50:29] this makes a lot of sense, a sytax test at commit time [16:51:57] (03PS1) 10Milimetric: Add mediawiki_skin_diff properly [analytics/refinery] - 10https://gerrit.wikimedia.org/r/748160 [16:52:06] mforns: if you don't mind making sure I didn't mess that up ^ [16:52:15] ofc [16:54:22] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/748160 (owner: 10Milimetric) [16:55:17] hm - so IIRc the difference between the tow sanitization lists is that one accepts keep_all tag and the other doesn't - This piece is the one milimetric and I were missing, but it wqas straight available when we were either looking at the list itself or the code - I wonder how we could this more visible [16:58:01] joal: I think make it explicit in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Event_Sanitization#Allowlists, and maybe add more explanations in the code comments, and I like milimetric's idea of CI testing the allow-list at commit time [16:58:22] *I think we should make it explicit... [17:00:16] (deploying new allowlist now) [17:00:30] mforns: yeah, I know the wiki page tells it plain, but if you don't have the idea of looking at it before reviewing the sanitization, well too bad :) Many we could add a link to the doc page in the code, in the main header, with big 'IMPORTANT NOTE ABOUT KEEP_TAG' ? [17:02:20] yes, also put it in the allow-list itself [17:02:31] we could do that [17:02:38] thinking.. [17:04:07] joal: another thing we could do is to change the error message in the code. [17:04:13] Instead of "Invalid allowlist value keep for nested field `normalized_host` ..." [17:04:22] It should say: [17:04:37] wait no, that one is good [17:05:02] YES! I follow you - Good call mforns - Having different messages for when there is a "keep_all" while it shouldn't be used [17:05:11] otherright? [17:05:13] yes, that's it [17:05:23] yeah that makes sense [17:05:31] And possibly some comments :) [17:06:10] The code is complex, so adding multiple comments a different strategic places about that one is of importance IMO :) [17:06:31] joal: hm, there's actually an error that indicates exactly that [17:06:34] +1, we can delay the CI idea, but I think that would be fun to look more holistically at everything we do and see how we could CI it, to help both maintenance and onboarding and in general peace of mind [17:06:50] RECOVERY - Check unit status of refine_event_sanitized_analytics_immediate on an-launcher1002 is OK: OK: Status of the systemd unit refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:08:17] milimetric, joal: I'm looking for alerts from the updated list with keep_all, did we receive any? [17:08:32] can not find any [17:08:53] I don't think so mforns - milimetric told me that his patch with 'keep_all' had run succssfully [17:09:22] no I was wrong [17:09:31] ack milimetric [17:09:37] It failed the whole job, not just that schema [17:09:46] that's why I reverted [17:10:08] Ah I completely missed that my bad milimetric [17:10:17] I should have backscrolled [17:10:48] ok then I think the error message is in place, we can then improve the code comments as joal suggests, also add some comments in the yaml lists. [17:11:08] and we can create a task for CI testing those allow-lists [17:11:43] doing this [17:12:41] mforns wait, no, the error is there if you use keep_all, but the error if you use keep on a struct is still misleading, it should tell you to expand and use keep on all properties, that keep_all is not allowed here [17:12:53] (03CR) 10Joal: [C: 03+2] "Merging!" [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/730282 (owner: 10ODimitrijevic) [17:13:35] milimetric: you mean this one?: Original exception: java.lang.IllegalArgumentException: Invalid allowlist value keep for nested field `normalized_host` STRUCT<`project_class`: STRING, `project`: STRING, `qualifiers`: ARRAY, `tld`: STRING, `project_family`: STRING> [17:14:33] yep [17:14:49] we could add: "either mark all subfields with keep or use keep_all" [17:15:13] no! That's how we got confused :) [17:15:40] if anything "or use keep_all, but check whether or not it's allowed" [17:15:46] and: "NOTE: some lists do not allow the use of keep_all tags" [17:15:56] hm maybe: "You can't use keep_all with this sanitization-list" -- Like that it's explicit that the problem is with that? [17:16:21] joal: that can go in the yaml list code, yes, but not in the spark job [17:16:30] since it's generic [17:16:45] mforns: let's batcave for a minute? [17:16:48] k [17:18:37] (03Merged) 10jenkins-bot: exclude conflicting dependencies [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/730282 (owner: 10ODimitrijevic) [18:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [19:33:15] 10Data-Engineering, 10Generated Data Platform: Set up regular-repairs for AQS cassandra cluster tables - https://phabricator.wikimedia.org/T297944 (10Eevans) Repair is such a...complex, subject. So much so that I'm not sure how to do justice to all of the considerations in a phab comment. :/ Let me try this f... [21:56:03] 10Quarry, 10tool-sql-optimizer: Allow /query/new to accept sql parameter - https://phabricator.wikimedia.org/T196525 (10MusikAnimal) [22:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org