[00:33:58] RECOVERY - SSH on an-coord1002.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:37:31] 10Data-Engineering, 10Wmfdata-Python, 10Product-Analytics (Kanban): Release Wmfdata-Python 2.0 - https://phabricator.wikimedia.org/T300442 (10nshahquinn-wmf) [00:38:39] 10Data-Engineering, 10Wmfdata-Python, 10Product-Analytics (Kanban): Release Wmfdata-Python 2.0 - https://phabricator.wikimedia.org/T300442 (10nshahquinn-wmf) The bulk of the removals up are for review in https://github.com/wikimedia/wmfdata-python/pull/35, although I just realized I still need to update the... [03:15:12] (VarnishkafkaNoMessages) firing: (3) varnishkafka on cp1081 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [03:15:12] (VarnishkafkaNoMessages) firing: (3) varnishkafka on cp2033 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [03:20:12] (VarnishkafkaNoMessages) resolved: (3) varnishkafka on cp2033 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [03:20:12] (VarnishkafkaNoMessages) resolved: (11) varnishkafka on cp1081 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [04:47:40] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 04), 10Spike: Easy Flink Python UDF + SQL enrichment - https://phabricator.wikimedia.org/T320968 (10tchin) So I was using Kafka Client 3.2.3, but I noticed you were using 2.4.1. Switched to that and it solves the cluster authorization issue.... [05:37:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp1081 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp1081%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [05:37:12] (VarnishkafkaNoMessages) firing: (3) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [05:42:12] (VarnishkafkaNoMessages) resolved: (6) varnishkafka on cp1081 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [05:42:12] (VarnishkafkaNoMessages) resolved: (4) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [08:00:17] Hello btullis, we are regularly receiving some alerts about "SSH on an-coord1002.mgmt". You've already explained me what those mgmt are. So I think it's not urgent. Anyway, this message is in case you missed them. [08:37:33] 10Analytics-Radar, 10Domains, 10SRE, 10Traffic-Icebox, and 2 others: Don't set cookies in traffic layer for non-user facing domains (avoid false third-party cookie warning) - https://phabricator.wikimedia.org/T262996 (10hashar) [08:37:43] 10Analytics-Radar, 10Domains, 10SRE, 10Traffic-Icebox, and 2 others: Don't set cookies in traffic layer for non-user facing domains (avoid false third-party cookie warning) - https://phabricator.wikimedia.org/T262996 (10hashar) Gerrit kept reporting `org.apache.http.client.protocol.ResponseProcessCookies :... [08:54:36] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Remove Spark session timeout functionality from Wmfdata-Python - https://phabricator.wikimedia.org/T298179 (10Antoine_Quhen) If the timeout is removed, it could be possible to detect and alert when `not production` yarn applications are running for m... [09:32:13] aqu: Thanks for the heads-up. I think I've found a way to stop these alerts going to the whole data engineering team. I'll look at it now. [10:40:33] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Add spark and spark-operator images to operations/docker-images/production-images - https://phabricator.wikimedia.org/T318730 (10EChetty) [10:45:17] 10Data-Engineering-Planning, 10DBA, 10Data-Services, 10Data Pipelines (Sprint 04), 10cloud-services-team (Kanban): Prepare and check storage layer for bnwikiquote - https://phabricator.wikimedia.org/T319190 (10EChetty) a:03BTullis [11:00:26] 10Data-Engineering-Planning, 10Data Pipelines: 503 on Superset (reproducible) - https://phabricator.wikimedia.org/T322525 (10EChetty) > What would be the recommended way to query it with Spark? From looking across some articles, I get the impression that this would be Jupyter? @Michael Yup. So the suggested... [11:28:44] 10Data-Engineering-Planning, 10Data Pipelines: 503 on Superset (reproducible) - https://phabricator.wikimedia.org/T322525 (10EChetty) [11:29:13] 10Data-Engineering-Planning, 10Data Pipelines: 503 on Superset (reproducible) - https://phabricator.wikimedia.org/T322525 (10EChetty) [12:20:10] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Add spark and spark-operator images to operations/docker-images/production-images - https://phabricator.wikimedia.org/T318730 (10BTullis) 05Open→03Resolved [12:20:15] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Add spark and spark-operator images to operations/docker-images/production-images - https://phabricator.wikimedia.org/T318730 (10BTullis) I'm happy with the latest build of these. We know that the `gen... [12:20:37] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Event-Platform Value Stream (Sprint 04): [SPIKE] Deploy event driven stateless Flink service to DSE cluster - https://phabricator.wikimedia.org/T320812 (10BTullis) [12:47:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp1075 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp1075%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [12:52:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp1075 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp1075%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [13:18:14] 10Data-Engineering-Planning, 10DBA, 10Data-Services, 10Data Pipelines (Sprint 04), 10cloud-services-team (Kanban): Prepare and check storage layer for bnwikiquote - https://phabricator.wikimedia.org/T319190 (10EChetty) [14:05:43] 10Data-Engineering, 10Equity-Landscape: Grants input metric - https://phabricator.wikimedia.org/T309276 (10ntsako) Hi @KCVelaga_WMF, I have update the query to use the proper columns from the grants spreadsheet. ` -- Updated query -- Please use Hue instead WITH total_grants AS ( SELECT country_code,... [14:14:35] 10Data-Engineering, 10Data-Engineering-Kanban, 10Shared-Data-Infrastructure, 10Patch-For-Review: Fix turnilo after upgrade - https://phabricator.wikimedia.org/T308778 (10EChetty) [14:16:54] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Fix turnilo after upgrade - https://phabricator.wikimedia.org/T308778 (10EChetty) [14:25:09] 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team, 10Research: Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score- - https://phabricator.wikimedia.org/T317768 (10EChetty) [14:25:32] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: RAID battery alert in an-worker1083 - https://phabricator.wikimedia.org/T321809 (10EChetty) a:03Stevemunene [14:25:49] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: an-worker1090 MegaRaid issues - https://phabricator.wikimedia.org/T315748 (10EChetty) a:03Stevemunene [14:26:14] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): RAID battery alert in an-worker1083 - https://phabricator.wikimedia.org/T321809 (10EChetty) [14:26:18] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): an-worker1090 MegaRaid issues - https://phabricator.wikimedia.org/T315748 (10EChetty) [14:27:16] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Fix turnilo after upgrade - https://phabricator.wikimedia.org/T308778 (10EChetty) a:05BTullis→03Stevemunene [15:24:27] 10Analytics-Jupyter, 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 04), 10Patch-For-Review: Add support for jupyterhub on conda-analytics - https://phabricator.wikimedia.org/T321088 (10xcollazo) @BTullis for whenever you have some time, please look at T321088#8375212. If you... [15:38:22] (03CR) 10Snwachukwu: [WIP] Add Custom Authentication Configuration Class for Cassandra. (038 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/851077 (https://phabricator.wikimedia.org/T306895) (owner: 10Snwachukwu) [16:11:50] 10Analytics-Jupyter, 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 04), 10Patch-For-Review: Add support for jupyterhub on conda-analytics - https://phabricator.wikimedia.org/T321088 (10BTullis) I managed to fix this on an-test-client1001 by running the following manually. ` a... [16:44:58] (03PS2) 10Snwachukwu: Add Custom Authentication Configuration Class for Cassandra. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/851077 (https://phabricator.wikimedia.org/T306895) [17:09:53] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 04): Allow Cormac Parle and Marco Fossati to deploy analytics-platform-eng Airflow instance - https://phabricator.wikimedia.org/T321925 (10EChetty) p:05Triage→03High [17:23:59] 10Analytics-Radar, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10fgiunchedi) I have deployed the bandaid above, so ifup failures will be reset (only once) on Ganeti VMs two minutes after boo... [17:25:08] 10Analytics-Radar, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I'll optimistically resolve the task, though of course reopen if sth is am... [18:36:48] heya joal, you still on? if so, do you want to brainbounce on notebooks in Airflow? I have a couple ideas I'd like to confirm :] Otherwise, tomorrow is good too! [18:57:25] (03PS1) 10DCausse: [WIP] cirrussearch: add fetch_failure schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) [19:02:08] mforns: heya - sorry I missed the ping - I'll have time in about 1h - does that work for you? [19:52:55] ping mforns - are ou around? [19:53:20] heya joal! [19:53:27] yes, bc? [19:54:13] OMW! [20:30:21] 10Data-Engineering, 10Data Pipelines: Add support for repository artifacts in Airflow - https://phabricator.wikimedia.org/T322690 (10mforns)