[00:06:39] 10Analytics, 10Event-Platform, 10SRE, 10Sustainability (Incident Followup): Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699 (10Legoktm) [00:08:28] 10Analytics-EventLogging, 10Analytics-Radar, 10Growth-Team, 10Growth-Team-Filtering, and 5 others: Multiple MediaWiki hooks are not documented on mediawiki.org - https://phabricator.wikimedia.org/T157757 (10Legoktm) 05Declined→03Open @odimitrijevic why did you decline this? [00:09:51] 10Analytics, 10Event-Platform, 10SRE, 10Sustainability (Incident Followup): Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699 (10Legoktm) @dcausse the $kafka_reporting_topic variable is still in puppet (https://gerrit.wikimedia.org/g/operations/puppet/+/974... [02:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [03:08:18] 10Analytics, 10Product-Analytics, 10Epic: Revamp analytics.wikimedia.org data portal & landing page - https://phabricator.wikimedia.org/T253393 (10nshahquinn-wmf) [05:13:08] PROBLEM - Check unit status of refinery-sqoop-mediawiki-production-daily on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-sqoop-mediawiki-production-daily https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [07:52:23] 10Analytics, 10Event-Platform, 10SRE, 10Sustainability (Incident Followup): Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699 (10dcausse) >>! In T296699#7536195, @Legoktm wrote: > @dcausse the $kafka_reporting_topic variable is still in puppet (https://gerr... [08:06:21] (03PS1) 10Elukey: gobblin: move to the new canonical bundle jks location [analytics/refinery] - 10https://gerrit.wikimedia.org/r/742672 (https://phabricator.wikimedia.org/T296089) [08:08:46] (03PS2) 10Elukey: gobblin: move to the new canonical bundle jks location [analytics/refinery] - 10https://gerrit.wikimedia.org/r/742672 (https://phabricator.wikimedia.org/T296089) [10:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [12:11:26] 10Data-Engineering, 10Data-Engineering-Kanban: Move spark.local.dir to /srv on stat100x - https://phabricator.wikimedia.org/T295346 (10BTullis) This changes was reverted because it caused immediate errors for user of Jupyter. The `spark2-shell` invovcation worked although the following warning was displayed.... [12:19:25] Hi btullis - is now a good time for us to check spark-sql-servier? [12:19:43] 10Data-Engineering, 10Data-Engineering-Kanban: Move spark.local.dir to /srv on stat100x - https://phabricator.wikimedia.org/T295346 (10BTullis) The full output from `wmfdata.spark.get_session()` on an-test-client1001 is here: P17899 I have attempted to trigger by: * activating my stacked conda environment * r... [12:19:56] Ho joal. Yes, let's. bc? [12:20:17] sure btullis - joining! [13:26:45] joal: o/ filed https://gerrit.wikimedia.org/r/c/analytics/refinery/+/742672 for Gobblin, it should be the last one :D [13:29:08] (03CR) 10Joal: [C: 03+1] "LGTM - About the puppet-management of that code, how fast do you think it should be done?" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/742672 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [13:34:23] (03CR) 10Elukey: gobblin: move to the new canonical bundle jks location (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/742672 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [14:06:27] milimetric: looks like the sqoop daily job failed [14:06:28] Nov 30 05:00:13 an-launcher1002 kerberos-run-command[24920]: 2021-11-30T05:00:13 ERROR /wmf/data/raw/mediawiki_private/tables/discussiontools_subscription/snapshot=latest already exists in HDFS. [14:07:05] based on this error, I think the correction is to add a "--force" to the sqoop command line [14:07:48] (03CR) 10Elukey: "To keep archives happy:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/742672 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [14:08:03] (03CR) 10Elukey: [C: 03+2] gobblin: move to the new canonical bundle jks location [analytics/refinery] - 10https://gerrit.wikimedia.org/r/742672 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [14:08:05] 10Analytics, 10Event-Platform, 10Goal: Modern Event Platform: Stream Connectors - https://phabricator.wikimedia.org/T214430 (10Ottomata) No, we never did it. This would have been Kafka Connect. Maybe now it would be based on Flink connectors? We could perhaps decline this and reopen or recreate if we ever... [14:08:13] (03CR) 10Elukey: [V: 03+2 C: 03+2] gobblin: move to the new canonical bundle jks location [analytics/refinery] - 10https://gerrit.wikimedia.org/r/742672 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [14:08:58] is there a train etherpad that I can use to add https://gerrit.wikimedia.org/r/c/analytics/refinery/+/742672 ? [14:09:02] (should be a no-op [14:10:07] 10Analytics-Radar, 10MediaWiki-ContentHandler: Allow YAML as an alternative for JSON on MediaWiki pages - https://phabricator.wikimedia.org/T237136 (10Ottomata) 05Open→03Declined On wiki jsonschemas for eventlogging are no longer supporrted. [14:12:03] elukey: https://etherpad.wikimedia.org/p/analytics-weekly-train [14:14:12] ottomata: <3 [14:14:41] I had the wrong link, the one in my browser cache led to an empty page, I thought you had changed it :) [14:14:56] elukey: (☞゚ヮ゚)☞ [14:19:45] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10BTullis) Tagging #data-engineering because we will likely be managing the Gobblin and/or Druid ingestion parts of this pipeline.... [14:29:24] 10Data-Engineering, 10Data-Engineering-Kanban: Move spark.local.dir to /srv on stat100x - https://phabricator.wikimedia.org/T295346 (10BTullis) I checked the access the log from my jupyterhub-conda-singleuser service with `journalctl -u jupyter-btullis-singleuser.service -f` When running ` Nov 30 12:09:39 a... [14:31:30] thanks ottomata ... hm... we gotta make it force overwrite or delete before running. Interesting [14:32:32] ah yeah, --force [14:33:34] 10Data-Engineering, 10Data-Engineering-Kanban: Move spark.local.dir to /srv on stat100x - https://phabricator.wikimedia.org/T295346 (10BTullis) This looks like it is an effect of the `ReadWritePaths` setting for the systemd unit that creates the jupyterhub-conda-singleuser service. ` btullis@an-test-client1001... [14:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [14:46:52] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Move spark.local.dir to /srv on stat100x - https://phabricator.wikimedia.org/T295346 (10BTullis) This [[https://gerrit.wikimedia.org/r/742732|patch]] looks like it should fix the issue with JupyterHub and `/srv/spark-tmp`. [14:54:11] k, I have https://gerrit.wikimedia.org/r/c/operations/puppet/+/742734 to force overwrite for the daily sqoop job, but I'm waiting a bit to see if mneisler or ppelberg want to change the schedule as well [14:59:21] 10Data-Engineering: Review druid deep-storage making sure that old segments having been reindexed are deleted - https://phabricator.wikimedia.org/T296207 (10JAllemandou) First some details on how data storage works in Druid: - After being computed, segments are stored on HDFS (deep-storage) - Depending on load... [15:40:42] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 (10elukey) 05Stalled→03In progress [15:40:44] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10SRE, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) [16:22:29] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 (10elukey) @Jgreen Hi! I am trying to move the Kafka Jumbo brokers TLS certs to the new PKI Intermediate CA dedicated to them, that will... [16:50:14] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 (10Jgreen) >>! In T296064#7537963, @elukey wrote: > @Jgreen Hi! I am trying to move the Kafka Jumbo brokers TLS certs to the new PKI Inte... [16:51:47] (03PS1) 10Milimetric: Link to AQS documentation instead of Research page [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/742764 (https://phabricator.wikimedia.org/T295298) [17:12:43] 10Analytics, 10Analytics-Kanban, 10Data-Engineering-Kanban, 10Patch-For-Review: Purge gobblin files - https://phabricator.wikimedia.org/T287084 (10JAllemandou) 05Open→03Resolved [17:12:47] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, and 2 others: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 (10JAllemandou) [17:29:43] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 (10elukey) >>! In T296064#7538028, @Jgreen wrote: >>>! In T296064#7537963, @elukey wrote: >> @Jgreen Hi! I am trying to move the Kafka Ju... [17:30:51] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10SRE, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) [17:31:28] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 (10elukey) 05In progress→03Stalled Back to stalled, let's do it in January! [17:39:23] 10Analytics-Radar, 10fundraising-tech-ops: puppetize CA changes for kafkatee on fundraising banner loggers - https://phabricator.wikimedia.org/T296765 (10Jgreen) [17:49:14] mforns: still there? got a sec to talk about artifact code stuff? [17:49:17] (not conda envs :p ) [17:49:33] yesss! bc? [17:49:36] ya [18:04:11] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) @BTullis thanks! Real-time, would be a nice plus, but a hard requirement (unlike netflow). @cmooney [[ https://gerrit.w... [18:10:50] 10Data-Engineering, 10Product-Analytics: [REQUEST] Notebook for testing with wmfdata - https://phabricator.wikimedia.org/T296420 (10ldelench_wmf) a:03nshahquinn-wmf [18:11:15] 10Data-Engineering, 10Product-Analytics (Kanban): [REQUEST] Notebook for testing with wmfdata - https://phabricator.wikimedia.org/T296420 (10ldelench_wmf) [18:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [18:43:30] I'm seeing a strange superset issue when I try to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/740712/, looks like this: [18:43:30] `Notice: /Stage[main]/Superset/Exec[init_superset]/returns: AttributeError: 'NoneType' object has no attribute 'auth_type` [18:43:30] I'm going to troubleshoot that before proceeding with the deployment train. superset.wikimedia.org is still up so it's not urgent [18:48:50] razzi: that's a strange error [18:48:53] here to help if you need [18:52:06] Cool wanna hop on a slack thing? [18:52:48] or I can just typitty type here [18:55:45] razzi lets typitty type, which node? [18:55:51] superset prod [18:55:51] an-tool1005? [18:55:54] an-tool1010 [18:56:14] though you make a good point, I should have deployed to 1005 first [18:57:08] I thought a change as simple as https://gerrit.wikimedia.org/r/c/operations/puppet/+/740712/ would go cleanly; with that being said I don't think the issue is coming from my patch [18:57:09] hmm, i think it shouldn't be running init_superset.sh [18:57:14] yeah [18:57:25] Ahh I see the problem [18:57:30] WIKIMEDIA_SUPERSET_TIMEOUT = int(timedelta(minutes=3).total_seconds()) [18:57:36] but I didn't import timedelta [18:57:39] ah! :) [18:57:59] Just for practice, let me revert the whole patch [18:58:00] superset_database_exists.py is failing because of a different thing [18:58:02] rather than rolling forward [18:59:00] https://gerrit.wikimedia.org/r/c/operations/puppet/+/742564 is the revert, do we usually wait for reviews for reverts ? [18:59:08] naw go for it [18:59:11] kk [18:59:44] I imagine superset_database_exists.py is also failing due to config not loading due to missing import [19:01:31] yup i think thats why [19:01:34] and because it failes the init is trying to run [19:14:09] 10Analytics, 10Event-Platform, 10Goal: Modern Event Platform: Stream Connectors - https://phabricator.wikimedia.org/T214430 (10Milimetric) It feels like a major part of the Event Platform, and certainly present in new diagrams other teams are drawing up. I think it should stick around and we should collabor... [19:18:33] 10Analytics, 10CheckUser, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team), 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Milimetric) We'll definitely need to change some plumbing if this changes, and it's been parked f... [19:32:23] 10Analytics-Radar, 10Product-Analytics: Develop a consistent rule for which special pages count as pageviews - https://phabricator.wikimedia.org/T240676 (10kzimmerman) a:05Iflorez→03None We discussed this again, and think it should be considered if/when we revisit how we measure pageviews. But again, it is... [19:45:38] mforns: you ok if i merge wmf_airflow_lib branch into main? [19:45:59] ottomata: yes [19:46:30] https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/4 [19:47:06] ottomata: I approved, do you want me to merge? [19:47:14] do we squash? [19:47:24] would prefer no squash on this one, this is lots o commits [19:47:29] i can merge [19:47:49] k [20:58:50] 10Analytics, 10Event-Platform, 10SRE, 10Sustainability (Incident Followup): Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699 (10Legoktm) Ack, thanks! I sent a heads-up to the ops list about this just in case some other application has started assuming it... [21:46:13] I'm going to roll forward the superset timeout change with the fix [21:57:53] hey folks! Would someone be able to take a look at the error we're getting in this Superset chart? https://superset.wikimedia.org/r/903 We haven't had this issue with the same chart before or similar queries run on this data set. Would be happy to log a phab but we're trying to do some active monitoring related to the Fundraising campaign, so an extra set of eyes would be most helpful! [21:58:28] Sure thing eyener , I'll take a look [21:58:38] Looks like `Error: {'message': 'SQL array indices start at 1', 'errorCode': 7, 'errorName': 'INVALID_FUNCTION_ARGUMENT'` ? [21:58:44] razzi: ty! [21:58:51] yup that's the one [21:59:18] it seems to be bugging out related to the `campaign` filter attached to the query [21:59:47] the table chart, unaggregated, and selecting just the `campaign` seems to work okay [21:59:58] hm ok [22:02:39] eyener: just looking at the query I don't see any sql arrays... [22:02:44] Let me see what the sql is that it's running [22:04:23] I put it in a paste: https://phabricator.wikimedia.org/P17913 [22:04:51] `access denied` [22:05:47] Hmm strange [22:06:15] I suppose you don't have WMF-NDA on your phabricator account eyener ? [22:06:24] I should.... [22:06:30] Anyways I think the issue is here: ` AND split_part(event.l[cardinality(event.l)],'|', 4) = '6'` [22:06:47] if cardinality(event.l) is 0, it would complain [22:07:26] What is the `AND ... = '6'` part doing? [22:07:36] I did stick a cardinality(event.l) > 0 filter on there with no luck [22:08:05] there is a calculated column for it, `event_l_card` [22:08:46] here's the link to the sqllab with the full query eyener https://superset.wikimedia.org/superset/sqllab?savedQueryId=394 [22:08:53] the `AND.....=6` clause grabs only banners with a status code of 'shown' (as apposed to 'closed' etc) [22:08:54] (hopefully you can open that one) [22:16:23] still there eyener ? I'm getting closer I think [22:16:34] yup still here and same :) hehe [22:16:51] cool cool [22:17:01] so I see "event.l" is a sort of array [22:17:33] Looking like `B2021_...|C2021_...|1606...|6` [22:17:56] yup [22:18:08] so that's each array element [22:18:20] Which array element do you want to use for the filter? [22:19:33] eyener: ^ [22:19:52] many of them. the last element (last_event_status_code, that mess of `split_part(event.l[cardinality(event.l)],'|', 4) ` should be 6 to indicate the status of 'shown' [22:20:07] the `campaign` element should be C2021..... [22:20:29] 10Analytics-EventLogging, 10Analytics-Radar, 10Growth-Team, 10Growth-Team-Filtering, and 5 others: Multiple MediaWiki hooks are not documented on mediawiki.org - https://phabricator.wikimedia.org/T157757 (10odimitrijevic) @Legoktm please feel free to reopen if you think this is still relevant. While I clos... [22:20:32] gotcha, so cardinality will give the number of elements, then event.l[cardinality(...)] gives the last one ? [22:20:48] 10Analytics-Radar, 10Growth-Team, 10Growth-Team-Filtering, 10MediaWiki-ContentHandler, and 4 others: Multiple MediaWiki hooks are not documented on mediawiki.org - https://phabricator.wikimedia.org/T157757 (10odimitrijevic) [22:20:55] it's a "parse and then filter" exercise so, yup! [22:21:38] ok so that makes sense now, and now I can see the problem is excluding empty event.l values [22:21:56] razzi: for reference here is a very simlar chart & query that executes without issue https://superset.wikimedia.org/r/907 [22:22:35] I thought so, too! But even with the event.l cardinality filter (https://superset.wikimedia.org/r/908) it is having issue... [22:23:51] eyener: I guess the filter order is significant; try dragging event_l_card > 0 to the top of the filters [22:24:05] interesting! [22:25:11] Hey! Look at that! It works, razzi! [22:25:17] https://superset.wikimedia.org/r/909 [22:25:48] not at the bottom, but toward the top :) Awesome. Thank you for your help, and for the eyes on this on such short notice. [22:25:49] Yup I see the chart :) [22:26:05] np! best of luck with all your reporting [22:26:07] wish I knew why but ... that's awesome!! [22:26:26] In my understanding, the filters apply in the order they are "written" [22:26:42] so if a higher up filter accesses event.l[0] it will error [22:26:52] even if a later filter would have left that row out entirely [22:26:58] it's already errored, game over! haha [22:28:22] hahah fair enough! [22:32:27] 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Goal, and 2 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10odimitrijevic) [22:33:20] 10Analytics, 10Event-Platform, 10Goal: Modern Event Platform: Stream Connectors - https://phabricator.wikimedia.org/T214430 (10odimitrijevic) 05Open→03Declined I am going to decline this task as is. Let's define new stories as part of the new Event Platform work. [22:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [22:35:50] razzi one more question for you - what is the time lag on reporting for the event stream and for druid.pageviews_[daily][hourly]? [22:36:43] that is a good question eyener that I don't know off the top of my head [22:37:10] 2 questions I guess [22:37:58] If you're interested in the current lag, you can ask the database: select(max(dt)) or somethign [22:38:31] if you'd like to know the typical lag, I'd have to see when the timers or whatever writes the updates are configured to run [22:42:12] 10Analytics, 10Event-Platform, 10SRE, 10Sustainability (Incident Followup): Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699 (10Ottomata) +1 [22:42:28] will do :) [22:42:29] eyener: Looking at the config, I can see the following [22:42:33] https://gitlab.wikimedia.org/btullis/puppet/-/blob/production/modules/profile/manifests/analytics/refinery/job/eventlogging_to_druid_job.pp#L165 [22:42:47] hourly runs `interval => '*-*-* *:00:00',` the start of every hour [22:43:05] daily runs ` interval => '*-*-* 00:00:00',` every day at midnight (utc) [22:43:32] so that's how the data gets in to druid [22:53:04] cool [22:53:18] and I'll ask the table, too, I was just taking shortcuts :) [23:25:17] 10Data-Engineering, 10Product-Analytics (Kanban): [REQUEST] Notebook for testing with wmfdata - https://phabricator.wikimedia.org/T296420 (10nshahquinn-wmf) [23:25:50] 10Data-Engineering, 10wmfdata-python, 10Product-Analytics (Kanban): Create a wmfdata-python test script - https://phabricator.wikimedia.org/T247261 (10nshahquinn-wmf) [23:27:03] 10Data-Engineering, 10Product-Analytics (Kanban): [REQUEST] Notebook for testing with wmfdata - https://phabricator.wikimedia.org/T296420 (10nshahquinn-wmf) I already have a open pull request for this in the wmfdata-python repo (although it hasn't been touched in about 10 months). I'll work on getting that fin...