[00:04:29] <icinga-wm>	 RECOVERY - Check unit status of drop_event on an-launcher1002 is OK: OK: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[04:15:53] <icinga-wm>	 RECOVERY - Check unit status of monitor_refine_event_sanitized_main_immediate on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_sanitized_main_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[04:16:07] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:41:40] <wikibugs>	 (03PS10) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566
[06:47:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe)
[06:49:26] <wikibugs>	 (03PS11) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566
[06:53:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe)
[07:12:16] <wikibugs>	 (03PS12) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566
[07:16:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe)
[07:22:10] <wikibugs>	 (03PS13) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566
[07:26:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe)
[07:55:10] <wikibugs>	 (03PS14) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566
[08:01:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe)
[08:25:34] <joal>	 !log Restart mediawiki_history_denormalize job manually
[08:25:36] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:26:03] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Operations, 10SRE, 10SRE-Access-Requests: Access request to analytics system(s) for TThoabala - https://phabricator.wikimedia.org/T315409 (10Jelto) 05Open→03Stalled a:05Ladsgroup→03TThoabala
[08:26:22] <wikibugs>	 (03PS15) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566
[08:34:14] <wikibugs>	 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Discovery-Search (Current work): Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10pfischer)
[08:35:16] <wikibugs>	 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Discovery-Search (Current work): Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10pfischer) Added missing information.
[08:47:05] <wikibugs>	 (03PS16) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566
[08:51:53] <wikibugs>	 (03PS17) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566
[08:54:46] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ayounsi) 05Resolved→03Open FYI, Netbox is alerting with: `an-presto1009 (WMF11494)  Device is in PuppetDB but is Planned in Netbox (should...
[09:10:37] <wikibugs>	 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Discovery-Search (Current work): Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Gehel) Re-added SRE as this is ready to move forward.
[09:14:07] <wikibugs>	 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Discovery-Search (Current work): Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Ladsgroup) \o/ Welcome Peter!
[09:15:46] <wikibugs>	 (03PS18) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566
[09:46:46] <wikibugs>	 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Discovery-Search (Current work): Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Jelto) p:05Triage→03Medium a:03Jelto
[09:59:21] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] analytics/legacy/helppanel: Add new actions related to mid-edit signup [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/828676 (https://phabricator.wikimedia.org/T310320) (owner: 10Gergő Tisza)
[10:00:06] <wikibugs>	 (03Merged) 10jenkins-bot: analytics/legacy/helppanel: Add new actions related to mid-edit signup [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/828676 (https://phabricator.wikimedia.org/T310320) (owner: 10Gergő Tisza)
[10:12:29] <wikibugs>	 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Jelto) Welcome @pfischer! Thanks for the request and all the approvals.  We are missing one last approval from @thcipri...
[10:20:10] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:21:48] <icinga-wm>	 PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:31:38] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:33:16] <icinga-wm>	 RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:33:31] <wikibugs>	 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Jelto) SSH key verified by Meet session and Gerrit +1 in https://gerrit.wikimedia.org/r/829148
[11:16:36] <wikibugs>	 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 3 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Zabe)
[12:00:29] <wikibugs>	 (03PS19) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566
[12:21:54] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "Vacations are long finished. I have went on polishing up the Gerrit extension, the first change being at https://gerrit.wikimedia.org/r/c/" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar)
[12:48:38] <wikibugs>	 (03PS20) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566
[13:02:58] <milimetric>	 joal: I see that denormalized failed, with the same "Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: A shuffle map stage with indeterminate output was failed and retried. However, Spark cannot rollback the ShuffleMapStage 953 to re-process the input data, and has to fail this job. Please eliminate the indeterminacy by checkpointing the RDD before repartition and try again." as before.
[13:03:06] <joal>	 yup milimetric 
[13:03:10] <milimetric>	 shall I increase the resources again?
[13:03:18] <joal>	 I'm trying to rerun it with different resources
[13:03:28] <milimetric>	 oh!  but it's my ops week :P
[13:03:59] * milimetric looks at houses in Tokyo so he can wake up earlier than jo
[13:04:04] <joal>	 milimetric: I'll also send a CR adding a checkpoint - it'll probably make the job longer, but it'll in any case be shorter than a rerun
[13:04:31] <milimetric>	 we did gain a couple days with the sqoop reorder, so it'll be plenty fast
[13:04:45] <milimetric>	 k, I'll review
[13:04:55] <joal>	 milimetric: I knew it was your ops week, but mine just finished and I thought it would be good to retry
[13:05:11] <joal>	 only concern with the patch is thatit'll mean rerunning the thing with spark3
[13:05:28] <joal>	 Which we should move to in any case, so possibly let's prioritize that
[13:06:42] <milimetric>	 I can help, but not too much today, I have to focus on the presentation
[13:07:37] <joal>	 no emeregency milimetric - we'll have the thing working with a manual run, then I'll provide a patch, test with spark3, and finally I'll sent an airflow job - all good :)
[13:07:59] <milimetric>	 ok, I can definitely help review after today
[13:08:30] <joal>	 milimetric: while I'm with you - I've added quite some things to next week deploy - let's sync on that as you wish
[13:09:48] <milimetric>	 definitely, I'll ask when I get to it
[13:12:35] <wikibugs>	 (03PS21) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566
[13:17:04] <wikibugs>	 (03PS22) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566
[13:18:56] <joal>	 milimetric: I'm babysitting the job, and some other thing we should do is to find a way to un-skew non-regular joins (more explanations on demand) - Cause problams occur only when doing those weird temporal joins, where data is grouped by userId and then pageId
[13:19:15] <joal>	 some userId and some pageIds have som much that they skew the data a lot
[13:31:04] <wikibugs>	 (03PS23) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566
[13:35:55] <milimetric>	 I wonder if we should look at the whole algorithm again once we upgrade to spark 3.  There are a bunch of things and I wonder if they're related: incremental updates, storing the output in iceberg, supposed better skewed join handling in spark 3, etc.
[14:08:03] <icinga-wm>	 PROBLEM - SSH on stat1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:18:38] <wikibugs>	 (03CR) 10Gehel: "This looks great! The ArrayUDFAggregation class and use of inheritance makes this a lot nicer!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe)
[14:49:29] <icinga-wm>	 RECOVERY - SSH on stat1006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:24:07] <joal>	 milimetric: job succeeded! \o/
[16:33:19] <milimetric>	 fabulous, well done on the new params then
[18:01:30] <mforns>	 heya teammm, do you know what happened with stat1006 earlier on today?
[19:38:11] <wikibugs>	 10Quarry: build container on PR - https://phabricator.wikimedia.org/T316958 (10rook)
[20:12:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp4030 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4030%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[20:14:18] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[20:17:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp4030 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4030%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[20:24:18] <jinxer-wm>	 (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[20:34:18] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for mediawiki_history_reduced_2022_08 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[20:54:18] <jinxer-wm>	 (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for mediawiki_history_reduced_2022_08 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[20:57:32] <wikibugs>	 10Quarry: test irc integration - https://phabricator.wikimedia.org/T316961 (10rook)