[06:53:18] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [06:59:18] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for mediawiki_history_reduced_2022_05 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [07:03:18] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [07:19:18] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for mediawiki_history_reduced_2022_05 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [08:11:02] (03CR) 10DCausse: Add Schema for Enriched MW Streams (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/799351 (https://phabricator.wikimedia.org/T308017) (owner: 10Luke Bowmaker) [08:20:26] Hi! Here is a patch to fill the missing airflow scheduler pid file: https://gerrit.wikimedia.org/r/c/operations/puppet/+/803396 could someone review it? [09:04:10] (03PS1) 10Lucas Werkmeister (WMDE): Sum site_stats rows [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803458 (https://phabricator.wikimedia.org/T310043) [09:05:22] aqu: I'm happy with that patch. Would you like me to merge and deploy? [09:13:22] Hmm. Why doesn't that change have any effect on an-launcher1002 I wonder? https://puppet-compiler.wmflabs.org/pcc-worker1003/35759/ [09:51:00] I have replied on the CR [09:58:17] 10Data-Engineering, 10Equity-Landscape: Transformations Flowchart - https://phabricator.wikimedia.org/T306614 (10KCVelaga_WMF) For Readership domain: https://lucid.app/lucidchart/7337fa91-3339-41bc-8371-93ec0fb3e779/edit?invitationId=inv_11c464fe-134f-4b49-946c-f9ed14a998bc For Editorship domain: https://lucid... [10:13:15] PROBLEM - Check unit status of eventlogging_to_druid_netflow-sanitization_daily on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow-sanitization_daily https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:14:41] Aah, this Druid alert is probably caused by me --^ I'm running the `sre.druid.roll-restart-workers` cookbook to pick up a new JVM. I will look at fixing this. [10:17:31] btullis: we have a new mw history snapshot that we should load up in druid soon. I'll send the patch [10:18:28] hey, just a heads-up as I'm not certain you are all getting the same messages - I'm getting warnings about stretch VMs being shut down by 30th of June in the Analytics project in WMCS [10:18:46] probably some of the aqs instances that were recreated at some point [10:20:02] hnowlan: Thanks, that's a good shout. I'll see if we can terminate them, or rebuild them if needed. [10:20:33] milimetric: Thanks, I will look out for it and merge it asap. [10:20:45] https://gerrit.wikimedia.org/r/c/operations/puppet/+/803472 [10:21:07] it's this procedure though, sadly not just merging: https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS#Deploy_new_History_snapshot_for_Wikistats_Backend [10:21:48] I mention it because it restarts the nodes so it works with what you were saying above about the alerts [10:24:04] milimetric: Thanks, that's OK. I'll roll-restart the aqs after merging. The alerts were caused by my restarting Druid, so a little downstream from aqs the Node app, but thanks for the heads-up anyway. [10:24:46] oh right, sorry, I forgot we split up the druid/aqs nodes a long time ago... morning brain :) [10:25:00] my biggest problem is forgetting stuff [10:27:01] 🙂 I also have to do a rolling restart of the cassandra instances behind aqs, because they need the new Java runtime too. [10:31:47] milimetric: Are you able to merge this for me please, if possible? My +2 rights on the repo haven't appeared so I've asked in #wikimedia-releng about it. https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/803471 [10:42:46] milimetric: The new snapshot has loaded and it's visible in Wikistats. I'll drop a note in the PA Slack channel so they can re-run their jobs if they need. Anyone else we need to tell? [10:45:40] I'll reply to my initial email thread. We're late but not that late, should be ok [10:46:47] !log Delete old dag-runs for interlanguage-daily (before 2022-06-01) [10:47:52] Hi milimetric - let me know if/when you wish to talk about the mediawiki-history failure [10:49:10] joal: it worked with 32+8 G memory for the workers (instead of 24+6. That's about 1/3 of the cluster so I'm not sure what we want to do [10:50:20] I can talk more in a few hours after I drop off the little monsters [10:50:47] let's talk later - take your morning gently :) [10:50:50] milimetric: --^ [10:51:09] !log restart the eventlogging_to_druid_netflow-sanitization_daily service on an-launcher1002 [10:51:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:54:47] RECOVERY - Check unit status of eventlogging_to_druid_netflow-sanitization_daily on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow-sanitization_daily https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:12:04] 10Data-Engineering, 10Event-Platform, 10Observability-Alerting, 10Patch-For-Review: Apparent latency warning in 90th centile of eventgate-logging-external - https://phabricator.wikimedia.org/T294911 (10BTullis) With regard to this //apparent latency//, @akosiaris has identified the cause and provided the f... [11:33:29] !log deployed an updated version of eventgate to eventgate-analytics-external to address the timing mis-calculation. [11:33:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:42:11] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:53:21] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:00:50] 10Data-Engineering, 10Event-Platform, 10Observability-Alerting, 10Patch-For-Review: Apparent latency warning in 90th centile of eventgate-logging-external - https://phabricator.wikimedia.org/T294911 (10BTullis) The update to service-runner worked as expected for eventgate-analytics-external. [[https://graf... [13:30:04] (03CR) 10Aqu: [C: 03+2] "Looks good ✔" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/802745 (owner: 10Joal) [13:41:30] (03Merged) 10jenkins-bot: Update SparkSQLNoCLIDriver to error correctly [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/802745 (owner: 10Joal) [13:45:55] !log deploying updated eventgate images to all remaining deployments. [13:45:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:02:24] joal: ok, back :) [14:02:39] milimetric: my turn to leave for kids :S [14:02:43] After standup? [14:02:47] sure, no rush [14:02:59] thanks a lot for fixing milimetric <3 [14:18:27] milimetric: Hi, I will tweak the options for Clickstream to avoid Skein log collection on this job. This is a known limitation. [14:27:57] milimetric: I just realized a thing that might prevent us from using the Airflow data ingestion code as we discussed... [14:28:25] if you have time, we can batcave [14:35:12] milimetric: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/75 [14:35:50] mforns: omw cave [14:36:41] milimetric: ok! [14:37:04] +2 aqu, it'll merge soon [14:45:24] pushed mforns [14:58:41] 10Data-Engineering, 10Equity-Landscape: Affiliates input metric - https://phabricator.wikimedia.org/T309275 (10ntsako) lua data loaded on ` SELECT * FROM ntsako.organizational_info WHERE year = 2021; ` [15:00:10] 10Data-Engineering, 10Equity-Landscape: Affiliates input metric - https://phabricator.wikimedia.org/T309275 (10ntsako) Affiliate leadership data loaded on ` SELECT * FROM ntsako.affiliate_leadership WHERE year = 2021 ` [15:17:13] does anyone know what this SPDX licence header thing is all about? [15:17:14] https://www.irccloud.com/pastebin/9K3nE8Cm/ [15:23:34] a-team: btullis was interested in looking at the clickhouse thing I was playing with during the hackathon, so I'm going to play with it for a few minutes before standup. Join the batcave if you're interested. [15:37:15] 10Data-Engineering, 10Data-Engineering-Kanban, 10Generated Data Platform: [Shared Event Platform] - Research Flink Changelog semantics to inform POC MW schema design - https://phabricator.wikimedia.org/T310082 (10Ottomata) [15:48:41] there's nobody there :) moving to post-standup-post-retro [15:58:53] milimetric: one more comment to the MR... sorry for being so consistently annoying [15:59:06] 10Data-Engineering, 10Data-Engineering-Kanban, 10Generated Data Platform: [Shared Event Platform] - Research Flink Changelog semantics to inform POC MW schema design - https://phabricator.wikimedia.org/T310082 (10Ottomata) Specifically, we should see if we can make our state change streams work with the [[ h... [15:59:15] mforns: sorry for not getting it right so consistently :) [15:59:52] xD no, no, I'm completely confused what the right thing to do is here.. [16:03:35] (03CR) 10Milimetric: "note: we may just move this to Airflow" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/803551 (https://phabricator.wikimedia.org/T309987) (owner: 10Milimetric) [16:06:55] 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Generated Data Platform: [Shared Event Platform] Ability to use Event Platform streams in Flink without boilerplate - https://phabricator.wikimedia.org/T308356 (10Ottomata) [16:07:21] 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Generated Data Platform: [Shared Event Platform] Ability to use Event Platform streams in Flink without boilerplate - https://phabricator.wikimedia.org/T308356 (10Ottomata) @lbowmaker I'm okay with calling this task done for now. There wil... [16:09:47] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Low Risk Oozie Migration: 4 wikidata metrics jobs - https://phabricator.wikimedia.org/T300021 (10Ottomata) [16:19:27] (03CR) 10Joal: [C: 03+1] "Values work for me :) Let's decide if we wish to move this to airflow" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/803551 (https://phabricator.wikimedia.org/T309987) (owner: 10Milimetric) [16:26:46] (03PS1) 10Joal: Update hql wikidata_specialentity file [analytics/refinery] - 10https://gerrit.wikimedia.org/r/803563 [16:27:34] mforns, SandraEbele - I have sent this code review following the one you merged --^ [16:27:47] 👍 [16:28:35] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/803563 (owner: 10Joal) [16:29:07] mforns: This patch of me needs a following patch in Airflow as well (HQL filename change) [16:29:14] We need to sync :) [16:29:17] yes [16:29:21] ack! [16:31:07] I have another patch of the same HQL file which was merged yesterday. [16:31:52] SandraEbele: Yes, I have sent that patch on top of it [16:32:13] Okay. [16:32:17] and I think we can use Sandra's current airflow patch to modify the corresponding DAG properties: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/74 [16:32:27] I added a new comment ther, sandra [16:33:03] Awesome mforns - We could also make that spark3 if you wish :) [16:33:07] SandraEbele: --^ [16:33:14] Okay @mfo [16:33:15] oh yea [16:33:23] Okay @mforms [16:38:06] mforns, aqu: I'll need you on this as well - https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/76 [17:02:57] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Event Carried State Transfer - Problem Statement - https://phabricator.wikimedia.org/T291120 (10LNguyen) [17:25:23] joal: left a comment on last MR [17:26:20] mforns: your comment also applies to keytab, I'm not sure what the best option is. ottomata: do you have opinions here? https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/63#note_8034 I'd like to get this sorted so we can deploy [17:31:22] yea sorry I wrote the comment to the wrong patch. [18:14:34] joal: responded to your changes! [18:31:22] 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Generated Data Platform: [Shared Event Platform] Ability to use Event Platform streams in Flink without boilerplate - https://phabricator.wikimedia.org/T308356 (10lbowmaker) 05Open→03Resolved [19:20:47] (03CR) 10Hoo man: [C: 03+2] Sum site_stats rows [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803458 (https://phabricator.wikimedia.org/T310043) (owner: 10Lucas Werkmeister (WMDE)) [19:21:22] (03CR) 10CI reject: [V: 04-1] Sum site_stats rows [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803458 (https://phabricator.wikimedia.org/T310043) (owner: 10Lucas Werkmeister (WMDE)) [19:26:03] joal: approved, do you want me to merge? [19:26:21] which one mforns ? [19:26:39] joal: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/76 [19:26:57] Ah! the spark3 one - we need to wait mforns - the artifact has not yet been released [19:27:23] hm, or we can merge if we don't plan on deploying before tomorrow [19:27:31] But I'd rather wait and do it all at once [19:27:39] ok, feel free to merge! [19:27:42] Thanks mforns :) [20:50:39] 10Data-Engineering, 10Data-Engineering-Kanban, 10Generated Data Platform: [Shared Event Platform] - Research Flink Changelog semantics to inform POC MW schema design - https://phabricator.wikimedia.org/T310082 (10Ottomata) Okay, todays findings: - upsert_kafka connector also needs a primary key set, as well... [23:15:28] (03PS1) 10Hoo man: Update composer "require-dev" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803618 [23:16:00] (03CR) 10Hoo man: [C: 03+2] "https://gerrit.wikimedia.org/r/c/analytics/wmde/scripts/+/803618 will make Jenkins happy again!" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803458 (https://phabricator.wikimedia.org/T310043) (owner: 10Lucas Werkmeister (WMDE)) [23:16:36] (03CR) 10CI reject: [V: 04-1] Sum site_stats rows [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803458 (https://phabricator.wikimedia.org/T310043) (owner: 10Lucas Werkmeister (WMDE))