[00:04:29] RECOVERY - Check unit status of drop_event on an-launcher1002 is OK: OK: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:15:53] RECOVERY - Check unit status of monitor_refine_event_sanitized_main_immediate on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_sanitized_main_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:16:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:41:40] (03PS10) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 [06:47:22] (03CR) 10CI reject: [V: 04-1] Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe) [06:49:26] (03PS11) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 [06:53:22] (03CR) 10CI reject: [V: 04-1] Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe) [07:12:16] (03PS12) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 [07:16:56] (03CR) 10CI reject: [V: 04-1] Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe) [07:22:10] (03PS13) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 [07:26:54] (03CR) 10CI reject: [V: 04-1] Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe) [07:55:10] (03PS14) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 [08:01:45] (03CR) 10CI reject: [V: 04-1] Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe) [08:25:34] !log Restart mediawiki_history_denormalize job manually [08:25:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:26:03] 10Data-Engineering, 10Data-Engineering-Operations, 10SRE, 10SRE-Access-Requests: Access request to analytics system(s) for TThoabala - https://phabricator.wikimedia.org/T315409 (10Jelto) 05Open→03Stalled a:05Ladsgroup→03TThoabala [08:26:22] (03PS15) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 [08:34:14] 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Discovery-Search (Current work): Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10pfischer) [08:35:16] 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Discovery-Search (Current work): Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10pfischer) Added missing information. [08:47:05] (03PS16) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 [08:51:53] (03PS17) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 [08:54:46] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ayounsi) 05Resolved→03Open FYI, Netbox is alerting with: `an-presto1009 (WMF11494) Device is in PuppetDB but is Planned in Netbox (should... [09:10:37] 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Discovery-Search (Current work): Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Gehel) Re-added SRE as this is ready to move forward. [09:14:07] 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Discovery-Search (Current work): Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Ladsgroup) \o/ Welcome Peter! [09:15:46] (03PS18) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 [09:46:46] 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Discovery-Search (Current work): Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Jelto) p:05Triage→03Medium a:03Jelto [09:59:21] (03CR) 10Kosta Harlan: [C: 03+2] analytics/legacy/helppanel: Add new actions related to mid-edit signup [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/828676 (https://phabricator.wikimedia.org/T310320) (owner: 10Gergő Tisza) [10:00:06] (03Merged) 10jenkins-bot: analytics/legacy/helppanel: Add new actions related to mid-edit signup [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/828676 (https://phabricator.wikimedia.org/T310320) (owner: 10Gergő Tisza) [10:12:29] 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Jelto) Welcome @pfischer! Thanks for the request and all the approvals. We are missing one last approval from @thcipri... [10:20:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:21:48] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:31:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:33:16] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:33:31] 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Jelto) SSH key verified by Meet session and Gerrit +1 in https://gerrit.wikimedia.org/r/829148 [11:16:36] 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 3 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Zabe) [12:00:29] (03PS19) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 [12:21:54] (03CR) 10Hashar: [C: 04-1] "Vacations are long finished. I have went on polishing up the Gerrit extension, the first change being at https://gerrit.wikimedia.org/r/c/" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar) [12:48:38] (03PS20) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 [13:02:58] joal: I see that denormalized failed, with the same "Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: A shuffle map stage with indeterminate output was failed and retried. However, Spark cannot rollback the ShuffleMapStage 953 to re-process the input data, and has to fail this job. Please eliminate the indeterminacy by checkpointing the RDD before repartition and try again." as before. [13:03:06] yup milimetric [13:03:10] shall I increase the resources again? [13:03:18] I'm trying to rerun it with different resources [13:03:28] oh! but it's my ops week :P [13:03:59] * milimetric looks at houses in Tokyo so he can wake up earlier than jo [13:04:04] milimetric: I'll also send a CR adding a checkpoint - it'll probably make the job longer, but it'll in any case be shorter than a rerun [13:04:31] we did gain a couple days with the sqoop reorder, so it'll be plenty fast [13:04:45] k, I'll review [13:04:55] milimetric: I knew it was your ops week, but mine just finished and I thought it would be good to retry [13:05:11] only concern with the patch is thatit'll mean rerunning the thing with spark3 [13:05:28] Which we should move to in any case, so possibly let's prioritize that [13:06:42] I can help, but not too much today, I have to focus on the presentation [13:07:37] no emeregency milimetric - we'll have the thing working with a manual run, then I'll provide a patch, test with spark3, and finally I'll sent an airflow job - all good :) [13:07:59] ok, I can definitely help review after today [13:08:30] milimetric: while I'm with you - I've added quite some things to next week deploy - let's sync on that as you wish [13:09:48] definitely, I'll ask when I get to it [13:12:35] (03PS21) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 [13:17:04] (03PS22) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 [13:18:56] milimetric: I'm babysitting the job, and some other thing we should do is to find a way to un-skew non-regular joins (more explanations on demand) - Cause problams occur only when doing those weird temporal joins, where data is grouped by userId and then pageId [13:19:15] some userId and some pageIds have som much that they skew the data a lot [13:31:04] (03PS23) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 [13:35:55] I wonder if we should look at the whole algorithm again once we upgrade to spark 3. There are a bunch of things and I wonder if they're related: incremental updates, storing the output in iceberg, supposed better skewed join handling in spark 3, etc. [14:08:03] PROBLEM - SSH on stat1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:18:38] (03CR) 10Gehel: "This looks great! The ArrayUDFAggregation class and use of inheritance makes this a lot nicer!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe) [14:49:29] RECOVERY - SSH on stat1006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:24:07] milimetric: job succeeded! \o/ [16:33:19] fabulous, well done on the new params then [18:01:30] heya teammm, do you know what happened with stat1006 earlier on today? [19:38:11] 10Quarry: build container on PR - https://phabricator.wikimedia.org/T316958 (10rook) [20:12:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp4030 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4030%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [20:14:18] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [20:17:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp4030 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4030%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [20:24:18] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [20:34:18] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for mediawiki_history_reduced_2022_08 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [20:54:18] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for mediawiki_history_reduced_2022_08 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [20:57:32] 10Quarry: test irc integration - https://phabricator.wikimedia.org/T316961 (10rook)