[00:15:59] <wikibugs>	 10Analytics, 10Analytics-Jupyter, 10Data-Engineering, 10Product-Analytics: conda list does not show all packages in environment - https://phabricator.wikimedia.org/T294368 (10nshahquinn-wmf) a:05odimitrijevic→03None
[00:21:44] <icinga-wm>	 RECOVERY - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:42:37] <wikibugs>	 (03PS1) 10Jdlrobson: Add `init` as a valid enum for action field [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/736933 (https://phabricator.wikimedia.org/T294738)
[00:43:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add `init` as a valid enum for action field [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/736933 (https://phabricator.wikimedia.org/T294738) (owner: 10Jdlrobson)
[00:43:25] <wikibugs>	 (03CR) 10Jdlrobson: "Hey Ottomata are changes like these to enum values valid without a change of schema or do they need to  be versioned?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/736933 (https://phabricator.wikimedia.org/T294738) (owner: 10Jdlrobson)
[00:46:40] <wikibugs>	 (03PS2) 10Jdlrobson: Add `init` as a valid enum for action field [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/736933 (https://phabricator.wikimedia.org/T294738)
[01:53:56] <wikibugs>	 10Analytics, 10Product-Analytics: AQS `edited-pages/new` metric does not make clear that the value is net of deletions - https://phabricator.wikimedia.org/T240860 (10nshahquinn-wmf)
[04:27:36] <icinga-wm>	 PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:19:27] <elukey>	 hello folks, going to try a second time to change tls certs in kafka test
[08:29:22] <joal>	 ack elukey
[08:34:49] <elukey>	 joal: bonjour
[08:34:54] <joal>	 Bonjour elukey 
[08:40:06] <joal>	 For whoever is interested: a paper review on an algorithm helping scaling resource management for big clusters - https://emptysqua.re/blog/parsync/
[09:46:27] <elukey>	 still some troubles with the truststore, going afk for a bit, will check later
[10:22:36] <btullis>	 joal: Morning. Following our discussion yesterday, I'm still not 100% clear as to why we only have 30 days' worth of webrequest data in Druid.
[10:23:09] <btullis>	 This seems to say that we keep 60 days' worth in deep storage: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/manifests/analytics/refinery/job/data_purge.pp$128
[10:24:19] <btullis>	 Do we purge /Historical/ in a different manner from /Deep storage/?
[10:30:53] <milimetric>	 btullis: those Refine sanitize monitor alerts are still me from yesterday, I wonder what we should dp to trick it so it stops firing
[10:34:22] <btullis>	 milimetric: Hmm. OK. Looking now. Catching up on your email from last night too.
[10:45:32] <elukey>	 btullis: historical nodes pull data from deep storage, and they cache the segments on localhost, but we send purge commands to their API so they handle the data drop etc..
[10:47:26] <btullis>	 milimetric: Is the data all refined now, as far as you are aware. I see that `monitor_refine_event_sanitized_analytics_delayed` last ran at 04:15 this morning and still gave an error, but I can't see any data that is not refined.
[10:48:03] <btullis>	 milimetric: Is the data all refined now, as far as you are aware? Do you want to talk about it in the batcave?
[10:50:43] <btullis>	 elukey: Thanks for that, but the question is: where does the 30 days retention come from?
[10:52:19] <elukey>	 btullis: I am checking the coordinator console, I see loadByPeriod set to P30D, I don't recall the setting but it might be it
[10:52:40] <elukey>	 ssh -L 8081:localhost:8081 an-druid1001.eqiad.wmnet
[10:52:56] <elukey>	 datasources -> webrequest_sampled.. -> 
[10:53:55] <elukey>	 https://druid.apache.org/docs//0.20.0/operations/rule-configuration.html
[10:55:43] <elukey>	 we have loadForever for most of the data sources, and a loadByPeriod for some
[10:57:19] <elukey>	 in theory if we change it to loadForever and we remove the drop rule we should be good
[10:57:21] <btullis>	 elukey: Excellent! Many thanks. So we keep 60 days' worth of `webrequest_sampled_128` in deep storage, so we /could/ load that much if needed for an ad-hoc investigation? But in general we only load the last 30 days' worth?
[10:59:12] <elukey>	 btullis: I think that we keep on 30 days in deep storage as well, IIRC the custom load+drop config takes care of cache + deep-storage
[10:59:21] <elukey>	 but joal is the expert
[11:00:08] <btullis>	 > in theory if we change it to loadForever and we remove the drop rule we should be good
[11:00:08] <btullis>	 OK, so this would add about another 625 GB of data to the historical nodes, but give an additional 30 days' worth of data in Turnilo?
[11:01:18] <elukey>	 I am trying to confirm how many segments we have from the UI but it seems not working super well, but it should be it
[11:01:29] <btullis>	 > the custom load+drop config takes care of cache + deep-storage
[11:01:29] <btullis>	 Oh, right. So the 60 day purge job in puppet is redundant then, if Druid itself is purging deep storage?
[11:01:57] <elukey>	 exactly, this is my understanding
[11:02:48] <btullis>	 Cool. Just to be clear, I'm not requesting a change, I'm just following up on _joe_'s question from yesterday and trying to fill gaps in my understanding.
[11:05:28] <elukey>	 sure sure
[11:06:43] <elukey>	 we also replicate the segments 2 times for webrequest
[11:07:22] <elukey>	 so yes let's wait for joal to confirm, and possibly advise about a new setting
[11:09:02] <btullis>	 :+1
[11:09:29] <btullis>	 There seem to be some older files and directories here:
[11:09:34] <btullis>	 https://www.irccloud.com/pastebin/s25qVqY9/
[11:50:06] <milimetric>	 btullis: I'll ping you when the kids are at school in a couple hours, but I think there are two kinds of situations the monitor is alarming on.  One is data that's not yet refined because this schema is new to the allow list and it's just not done yet.
[11:50:42] <milimetric>	 And Two is data that's been refined but there's some weird problem with it where the monitor still alarms
[11:51:34] <milimetric>	 September 18 has a few hours like that, it's in the range I manually ran and the output is clearly there but the alarm ran *after* my rerun and after I reset it
[12:01:39] <btullis>	 milimetric: ack - Thanks.
[12:26:06] <joal>	 Hi btullis 
[12:26:30] <btullis>	 Hiya joal
[12:28:53] <joal>	 btullis: first precising on druid load rule - those rules apply to historical nodes loading segments from deep-storage, and don't impact segment retention in deep-storage
[12:29:14] <joal>	 btullis: This is why we need an explicit deep-storage deletion job in puppet
[12:30:44] <joal>	 btullis: therefore we can load older data in historical nodes and query them by changing the retention rule
[12:30:59] <btullis>	 OK, got it, and this `30D+future` is why the historical daemons load 30 days fromdeep storage, right?
[12:31:02] <btullis>	 https://usercontent.irccloud-cdn.com/file/bNp63t56/image.png
[12:31:09] <joal>	 correct
[12:34:37] <joal>	 btullis: trying to understand the context - there is a request from sre for a one-off investigation?
[12:39:14] <btullis>	 It was just a query on #mediawiki_security yesterday as to why we only show 30 days in turnilo. _joe_ thought it used to show more than that. I wanted to be able to answer more fully than I was able, but there is no request for any change.
[12:39:38] <joal>	 ack :)
[13:58:06] <elukey>	 btullis: I was wrong, apologies :)
[13:58:30] <elukey>	 we can add a note in puppet to reflect this, so we'll remember
[13:58:39] <elukey>	 I was convinced that the rule dropped also from deep storage
[14:09:02] <elukey>	 also, would it be possible to expand the load rule to say 45/60d for webrequest_sampled?
[14:09:11] <elukey>	 it maybe really useful for SREs
[14:09:16] <elukey>	 (like Ben mentioned)
[14:30:03] <wikibugs>	 (03PS1) 10GoranSMilovanovic: minor [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/737049
[14:30:22] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] minor [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/737049 (owner: 10GoranSMilovanovic)
[14:52:52] <wikibugs>	 (03PS1) 10GoranSMilovanovic: minor [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/737053
[14:53:10] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] minor [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/737053 (owner: 10GoranSMilovanovic)
[15:30:26] <milimetric>	 mforns: hey are you around / want to debug refine monitor with us?
[15:42:49] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Airflow, 10Data-Engineering, 10Data-Engineering-Kanban: [Airflow] Create repository for Airflow DAGs - https://phabricator.wikimedia.org/T294026 (10mforns)
[15:48:37] <milimetric>	 btullis: I've had a mishap with some chocolate, so I'm gonna be gone until after lunch.  I think basically we can let the alert fire over the weekend, I'll reply to the email explaining where we are, and pick up on Monday again.  Maybe it's best to just let it catch up and then remove the _REFINED flags for all the hours that are still alerting
[15:49:03] <milimetric>	 we are moving to AirFlow, but my guess is that whatever the problem is it would follow us there, the spark logic wouldn't change
[15:53:00] <btullis>	 milimetric: Gotcha. Good luck with that chocolate.
[15:53:12] <milimetric>	 :)
[15:55:31] <btullis>	 Agree that it would be nice to get to the root of the issue though. If it keeps alerting on Monday after the backfil has finished (as we expect) maybe we could add some debug information or increase the verbosity to try to give us something more to work on?
[16:30:18] <wikibugs>	 10Analytics, 10Event-Platform, 10Readers-Web-Backlog, 10Patch-For-Review: WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Jdlrobson) @ovasileva @Jdrewniak are we a blocker here?
[16:36:15] <wikibugs>	 (03PS5) 10Bearloga: movement_metrics: Migrate pageviews tables and ETL [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/736583 (https://phabricator.wikimedia.org/T291956)
[16:37:08] <joal>	 dcausse: would you be around?
[16:37:50] <wikibugs>	 (03CR) 10Bearloga: [V: 03+2 C: 03+2] "Verified manually with personal database" [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/736583 (https://phabricator.wikimedia.org/T291956) (owner: 10Bearloga)
[16:41:35] <joal>	 maybe ebernhardson?
[16:45:24] <joal>	 no big deal folks, it'll wait for next week :)
[17:10:49] <mforns>	 milimetric: Heya just saw your message, I'm here if you want!
[17:11:19] <milimetric>	 hey mforns so we have a pretty good idea of _what_ is going on, but not _why_ :)
[17:11:31] <mforns>	 are you guys in da cave?
[17:11:48] <milimetric>	 so a few of the hours that RefineSanitize processed (without error) are making RefineMonitor fire
[17:11:55] <milimetric>	 we're not, but I can jump in
[17:12:30] <milimetric>	 if you look at 2021-09-19T20, it's in one of the recent refine monitor sanitize delayed alarm emails
[17:12:42] <mforns>	 aha
[17:12:43] <milimetric>	 and if you look at 2021-09-19T19, it's not in an alarm, but it looks exactly the same
[17:14:47] <mforns>	 lookin
[17:16:37] <elukey>	 joal: if you are still around https://gerrit.wikimedia.org/r/c/operations/puppet/+/737097/
[17:23:07] <mforns>	 milimetric: I think it might have to do with: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/analytics/refinery/job/refine.pp#L293
[17:23:36] <mforns>	 IIUC, the monitor for the delayed job is executed at the same time as the refine job (for the 3rd hour of the day)
[17:23:45] <wikibugs>	 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10SRE, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) Finally some progress!  Some notes: * We have now a generic define to create .p12/.jks truststores cont...
[17:24:01] <mforns>	 milimetric: the Refine job has no time to finish refinement, before the monitor can check all outputs
[17:25:21] <mforns>	 milimetric: for the immediate job it's different! The monitor executes with slightly different parameters (the defaults) which are now-26 to now-2 for the refine job, and now-28 and now-4 for the monitor. So there's a 2 hour grace period between refine and monitor there
[17:26:27] <mforns>	 milimetric: I guess, if we add a couple more hours to the monitor interval for the delayed job, the false alerts will stop
[17:26:57] <milimetric>	 hm, but that monitor ran way after I finished refining those hours, so the data was there
[17:27:18] <milimetric>	 unless it somehow started before and just took a long time...
[17:28:05] <mforns>	 were you doing backfilling?
[17:28:59] <milimetric>	 yeah, I'm still backfilling, basically all the data that's available in /event/ but older than 45 days will throw alarms unless we backfill
[17:29:33] <milimetric>	 since _delayed was going one day at a time, every day it was running was throwing alarms
[17:30:40] <mforns>	 aha, understand..
[17:30:42] <milimetric>	 I'm executing RefineSanitizeMonitor manually now for one of the time ranges
[17:30:45] <milimetric>	 to check
[17:30:49] <mforns>	 ok
[17:32:04] <milimetric>	 INFO RefineMonitor: No targets need or have failed refinement in database Some(event) -> database event_sanitized (/wmf/data/event_sanitized)
[17:32:56] <milimetric>	 ok, so btullis you were right, I was wrong, we should've just rerun it.  Somehow, even though it looked like it ran a few hours after RefineSanitize, the monitor was looking at old data.  It works fine if there's data there.
[17:33:11] <milimetric>	 ok, so then I'll just clear the alarm and we shouldn't have any more problems
[17:33:59] <milimetric>	 k, reset
[17:37:45] <elukey>	 have a good day/weekend folks
[17:43:08] <icinga-wm>	 RECOVERY - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:44:56] <wikibugs>	 10Analytics, 10Event-Platform, 10Patch-For-Review, 10Readers-Web-Backlog (Needs Prioritization (Tech)): WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Jdlrobson)
[17:47:58] <mforns>	 milimetric: I will push a patch to (hopefully) avoid getting alarms for newly added datasets in the future
[17:55:41] <mforns>	 milimetric: https://gerrit.wikimedia.org/r/c/operations/puppet/+/737110
[17:56:45] <joal>	 Thanks for investigating milimetric and mforns :)
[17:57:48] <milimetric>	 that makes sense, mforns, maybe it just looked like it ran after because it was taking a while to report.  +1 from me
[18:53:03] <wikibugs>	 (03PS1) 10MewOphaswongse: Add a link: add expand and collapse actions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737114 (https://phabricator.wikimedia.org/T293147)
[21:34:30] <wikibugs>	 (03PS1) 10MewOphaswongse: Add an image: update schema for editsummary_dialog [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737135 (https://phabricator.wikimedia.org/T294672)