[01:18:39] wouldn't newcomers need to get to horizon too? I'm not sure which words exactly to link to it, but it should be on the page somewhere in my opinion. Unless we're actively trying to prevent people from getting there or something similar (excuse my ignorance). The only thing that mentions horizon is a link to add a member as far as I can tell. [02:00:01] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q2:rack/setup/install cephosd100[1-5] - https://phabricator.wikimedia.org/T322760 (10Cmjohnson) [02:27:07] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10Cmjohnson) [06:12:14] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Wmfdata-Python triggers a Pandas warning during mariadb.run - https://phabricator.wikimedia.org/T324135 (10nshahquinn-wmf) [06:42:02] 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly, 10Product-Analytics: Analyze possible bot traffic for frwiki article Cookie (informatique) - https://phabricator.wikimedia.org/T313114 (10Ladsgroup) a:03Ladsgroup I'm going to heavily throttle page views to that article coming from the b... [08:23:22] 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly, 10Product-Analytics: Analyze possible bot traffic for frwiki article Cookie (informatique) - https://phabricator.wikimedia.org/T313114 (10ayounsi) @Ladsgroup rate limiting or blocking at the edge should be used only for traffic putting the... [08:28:40] 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly, 10Product-Analytics: Analyze possible bot traffic for frwiki article Cookie (informatique) - https://phabricator.wikimedia.org/T313114 (10Ladsgroup) a:05Ladsgroup→03None Yes. Makes sense. I won't do this. [10:48:42] hello folks [10:48:47] stat1004 seems a little in trouble [10:49:50] there is a huge java process that Dan is running, the main trouble is not cpu but simple things like df hangs [10:49:58] joal: (and/or anyone) quick hypothetical question - if we were to begin using an S3 client library for our jobs today, which would we likely be using? I'm researching how it/they would handle retries. [10:50:12] elukey: Thanks. I will take a look as well. [10:50:57] elukey: Looks like hardware related, according to `dmesg -T` [10:51:45] btullis: yeah I was about to say, sata issues [10:51:46] sigh [10:52:21] That host is scheduled to be decommissioned anyway, I believe. Maybe this will accelerate the process. [10:52:51] poor stat1004! So many years of duty :) [10:53:03] I always ssh to it, I'll have to rewrite my brain's habits :D [10:53:45] or maybe it is just one disk failed in the raid [10:56:00] I'll raise a ticket. At least we have both stat1009 and stat1010 insetup, ready to go. [10:56:57] You can just make a special alias in your .ssh/config so that stat1004 goes to stat1009 instead :-) [10:57:13] btullis: it is not the sameeeee I have feelings for those hosts [10:57:14] :D [10:57:26] ah also do we have 1009? [10:57:49] nice TIL [11:07:59] elukey: sorry, I [11:08:22] I am indeed running a spark-sql job, but why is it hammering stat1004... it should be doing all the work in executors... [11:08:40] milimetric: nono it is not that, probably a failed disk [11:08:41] :) [11:08:56] I did notice the fuse mount is broken on there [11:09:23] (which I have a talent for breaking, btw, 'cause I use it heavily sometimes, like tree /some/hdfs/path) [11:10:01] You might both be able to answer the question above - which S3 libraries would we use if we were to start migrating jobs today? [11:10:28] I can try to fix the fuse mount for you. [11:15:34] btullis: do you have a specific use case in mind? With jobs you mean airflow/oozie/spark? [11:18:05] elukey: Yes, any job which currently reads to or writes from HDFS. If an option of using S3 were available tomorrow, is there an obvious choice of S3 client library (or libraries) that you (or your team) would select first? [11:20:03] btullis: it probably depends on the language, for python boto3 is surely an option, and for spark there should be a ton of libs to access s3.. Other Map-Reduce jobs in java may follow the same [11:20:39] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 (10Stevemunene) We do not have the right build for bullseye, thus we need to upgrade the packages for that. Here is a snippet from the... [11:22:46] elukey: Thanks. For context, I'm researching the impact of different load-balancer choices (anycast, LVS, haproxy etc.) on the upcoming Ceph build and specifically its S3 component. [11:24:20] If we're in the middle of transferring a 10 GB file over S3 and a TCP connection from a client is interrupted (e.g. by taking a server out of the LB system), is that client capable of restarting from where it left off? [12:02:40] 10Analytics-Clusters, 10Analytics-Radar, 10SRE, 10serviceops: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10LSobanski) [12:11:34] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 (10BTullis) I think that we're in luck here, because the presto debs that we created are not compiled for a specific operating system.... [12:18:15] I've never worked with S3, the cloud revolution happened entirely while I was here riding physical hardware like some dirty cowboy [12:21:31] :-) Well now, we'd best sling a rope around that pesky protocol and make it do our bidding. [12:22:18] 10Data-Engineering-Planning, 10Data Pipelines, 10Data-Catalog: Spike: Integrate Spark with DataHub - https://phabricator.wikimedia.org/T306896 (10BTullis) [12:24:57] PROBLEM - Check systemd state on an-presto1006 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:31] ACKNOWLEDGEMENT - Check systemd state on an-presto1006 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service Btullis T323783 - host being brought into service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:44:21] elukey: milimetric: FYI Initial research into the boto3 S3 client retry behaviour looks good. https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html#standard-retry-mode [12:57:17] Cool [13:08:41] 10Data-Engineering: Requesting Kerberos identity for mlitn - https://phabricator.wikimedia.org/T324203 (10matthiasmullie) [13:24:20] 10Analytics-Clusters, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform Value Stream, and 2 others: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10Ottomata) [13:25:43] btullis: you should ask dcausse on search team, i know that had to think about s3 clients at some point (for swift?) [13:26:17] maybe sort of relevant? https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/s3/#hadooppresto-s3-file-systems-plugins [14:12:48] 10Data-Engineering-Planning: Requesting Kerberos identity for mlitn - https://phabricator.wikimedia.org/T324203 (10EChetty) [14:12:51] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Flink Tables should have a default ROWTIME column. - https://phabricator.wikimedia.org/T324144 (10EChetty) [14:12:57] 10Data-Engineering-Planning, 10Product-Analytics, 10Wmfdata-Python: Wmfdata-Python triggers a Pandas warning during mariadb.run - https://phabricator.wikimedia.org/T324135 (10EChetty) [14:12:59] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Flink + Event Platform integration for writing into streams via Table API - https://phabricator.wikimedia.org/T324114 (10EChetty) [14:13:01] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [SPIKE] Use Flink for batch backfilling - https://phabricator.wikimedia.org/T324108 (10EChetty) [14:13:03] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): Deploy Mediawiki Stream Enrichment on an-launcher1002. - https://phabricator.wikimedia.org/T323914 (10EChetty) [14:13:05] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): Spark Streaming Dumps POC: Update iceberg tables - https://phabricator.wikimedia.org/T323645 (10EChetty) [14:13:07] 10Data-Engineering-Planning: NEW FEATURE REQUEST: Dataset with active and non-active Wikis - https://phabricator.wikimedia.org/T323662 (10EChetty) [14:13:09] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): Spark Streaming Dumps POC: Backfill metadata table - https://phabricator.wikimedia.org/T323642 (10EChetty) [14:13:11] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): Spark Streaming Dumps POC: Backfill content table - https://phabricator.wikimedia.org/T323641 (10EChetty) [14:13:13] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering-Planning: Anonymous edits - https://phabricator.wikimedia.org/T323562 (10EChetty) [14:13:15] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): [SPIKE] Evaluate a pyflink version of Mediawiki Stream Enrichment - https://phabricator.wikimedia.org/T323217 (10EChetty) [14:13:17] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MW-1.40-notes (1.40.0-wmf.12; 2022-11-28), 10User-brennen, 10Wikimedia-production-error: EventBus: Error: Call to a member function isCurrent() on null - https://phabricator.wikimedia.org/T323294 (10EChetty) [14:13:21] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MW-1.40-notes (1.40.0-wmf.8; 2022-10-31): EventBus' stream config destination_event_service setting should move into producers.mediawikI_eventbus specific settings. - https://phabricator.wikimedia.org/T321557 (10EChetty) [14:13:23] 10Data-Engineering-Planning, 10Product-Analytics, 10Wmfdata-Python: Remove Matplotlib as a Wmfdata-Python dependency - https://phabricator.wikimedia.org/T324053 (10EChetty) [14:13:26] 10Data-Engineering-Planning, 10Editing-team, 10WMF-General-or-Unknown, 10Wikimedia-production-error: "Invalid revision ID -1" error for VisualEditorFeatureUse events, mostly from officewiki - https://phabricator.wikimedia.org/T322602 (10EChetty) [14:13:28] 10Data-Engineering-Planning, 10Patch-For-Review, 10Product-Analytics (Kanban): Add mediawiki_web_ab_test_enrollment to the allowlist - https://phabricator.wikimedia.org/T323664 (10EChetty) [14:13:30] 10Analytics-Clusters, 10Analytics-Radar, 10Data-Engineering-Planning, 10Event-Platform Value Stream, and 2 others: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10EChetty) [14:13:46] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Requesting Kerberos identity for mlitn - https://phabricator.wikimedia.org/T324203 (10EChetty) [14:15:08] 10Data-Engineering-Planning, 10Data Pipelines, 10Patch-For-Review, 10Product-Analytics (Kanban): Add mediawiki_web_ab_test_enrollment to the allowlist - https://phabricator.wikimedia.org/T323664 (10EChetty) [14:15:14] 10Data-Engineering-Planning, 10Data Pipelines: NEW FEATURE REQUEST: Dataset with active and non-active Wikis - https://phabricator.wikimedia.org/T323662 (10EChetty) [14:17:09] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Shared-Data-Infrastructure: Cleanup User Hive Databases - https://phabricator.wikimedia.org/T323884 (10EChetty) [14:17:18] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Shared-Data-Infrastructure: Cleanup User Hive Databases - https://phabricator.wikimedia.org/T323884 (10EChetty) [14:19:20] 10Data-Engineering-Planning: Check home/HDFS leftovers of faidon - https://phabricator.wikimedia.org/T322107 (10EChetty) [14:19:24] 10Data-Engineering-Planning: Check home/HDFS leftovers of bmansurov - https://phabricator.wikimedia.org/T320367 (10EChetty) [14:19:28] 10Data-Engineering-Planning: Check home/HDFS leftovers of jmads - https://phabricator.wikimedia.org/T319266 (10EChetty) [14:19:32] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Shared-Data-Infrastructure: Cleanup User Hive Databases - https://phabricator.wikimedia.org/T323884 (10EChetty) [14:19:36] 10Data-Engineering-Planning: Check home/HDFS leftovers of eyener - https://phabricator.wikimedia.org/T316072 (10EChetty) [14:19:40] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of bumeh-ctr - https://phabricator.wikimedia.org/T300607 (10EChetty) [14:19:50] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Requesting Kerberos identity for mlitn - https://phabricator.wikimedia.org/T324203 (10EChetty) [14:19:58] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Cleanup User Hive Databases - https://phabricator.wikimedia.org/T323884 (10EChetty) [14:20:22] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Cleanup User Hive Databases - https://phabricator.wikimedia.org/T323884 (10EChetty) p:05Triage→03Medium [14:20:40] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Requesting Kerberos identity for mlitn - https://phabricator.wikimedia.org/T324203 (10EChetty) p:05Triage→03High [14:23:00] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Requesting Kerberos identity for mlitn - https://phabricator.wikimedia.org/T324203 (10EChetty) [15:15:09] 10Analytics, 10AQS 2.0 Roadmap, 10API Platform (Sprint 02), 10Epic, and 2 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10JArguello-WMF) [16:03:33] (03CR) 10Ottomata: Add ios talk page interaction schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/857759 (https://phabricator.wikimedia.org/T321841) (owner: 10Mazevedo) [16:53:01] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Shared-Data-Infrastructure: [SPIKE] Deploy event driven stateless Flink service to DSE cluster - https://phabricator.wikimedia.org/T320812 (10Ottomata) @bking something I noticed about the upstream helm charts that I think we'd want to change, is... [17:29:14] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): Spark Streaming Dumps POC: Backfill content table - https://phabricator.wikimedia.org/T323641 (10Milimetric) a:03MunizaA [17:31:18] (03CR) 10Joal: "This is super great Antoine - I agree with Marcel's comments (thank you so much for the review :). I added 3 small things - This is then " [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/852315 (https://phabricator.wikimedia.org/T321168) (owner: 10Aqu) [17:35:34] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): Spark Streaming Dumps POC: Backfill metadata table - https://phabricator.wikimedia.org/T323642 (10Milimetric) a:03Milimetric [17:35:47] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): Spark Streaming Dumps POC: Backfill metadata table - https://phabricator.wikimedia.org/T323642 (10Milimetric) [18:44:13] (VarnishkafkaNoMessages) firing: varnishkafka on cp5021 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5021%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:49:13] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5021 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5021%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:56:42] 10Data-Engineering, 10Cassandra, 10Epic, 10Platform Team Workboards (Platform Engineering Reliability): Cassandra3 migration for Analytics AQS - https://phabricator.wikimedia.org/T249755 (10Eevans) [20:31:18] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): [SPIKE] Evaluate a pyflink version of Mediawiki Stream Enrichment - https://phabricator.wikimedia.org/T323217 (10gmodena) A pyflink implementation of Mediawiki Stream Enrichment has been developed and deployed on YARN. While this implement... [20:32:54] 10Analytics-Jupyter, 10Data-Engineering, 10Product-Analytics: Replace anaconda-wmf with smaller, non-stacked Conda environments - https://phabricator.wikimedia.org/T302819 (10xcollazo) [20:32:56] 10Analytics-Jupyter, 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 05-06): Add support for jupyterlab on conda-analytics - https://phabricator.wikimedia.org/T321088 (10xcollazo) 05In progress→03Resolved Haven't received any bugs feedback yet, that's good! Closing! [21:25:44] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 (10Stevemunene) Presto-service run fails on the Debian11 boxes due to a python issue caused by the unversioned ` /usr/bin/python ` req...