[06:44:05] (03PS16) 10Joal: Update refine to use Iceberg for event_sanitize [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/811212 (https://phabricator.wikimedia.org/T311739) [13:02:24] mforns: Good afternoon! Let me know when you have a minute to talk about the cassandra loading [13:02:42] heya! we can talk now if you wish! [13:02:52] \o/ [13:02:54] batcave? [13:03:06] actuall, batcvave in a minute? [13:03:07] omw! [13:05:16] Couldn't hear you mforns [13:05:17] :( [13:19:48] ottomata: Good morning! Let me know when you have a minute, there is question I'd to run by you [13:48:24] ping ottomata? [13:51:41] must be in meeting :) [13:55:12] 10Data-Engineering-Operations, 10SRE, 10SRE-Access-Requests: Requesting Kerberos access for alinebruenger and siko - https://phabricator.wikimedia.org/T316766 (10Aline_Bruenger_WMDE) 05Open→03Resolved a:03Aline_Bruenger_WMDE Thanks a lot! [13:57:32] joal: was in meeting [13:57:37] i have 3 mminutes before next meeting [13:57:38] hello! [13:57:45] ottomata: o/ I assumed so :) [13:57:55] ottomata: I guess we'll do after meetngs :) [13:58:09] okay shoudl be out at next hour:30 [13:58:09] ottomata: and sorry for the double ping [13:58:12] no prob! [13:58:21] i'm getting worse at seeing IRC pings these days :( [13:58:25] please ping more [13:58:32] :) [14:20:37] PROBLEM - SSH on analytics1077.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:34:43] joal: 5 mins... :) [14:41:54] ottomata: sorry I was away [14:41:59] ottomata: good now? [14:44:59] joally [14:45:00] yes! [14:45:04] joal: bc? [14:45:08] OMW! [15:27:31] mforns: I just sent https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/144 [15:27:38] mforns: would you mind checking it? [15:57:12] mforns: quick question- How can I make an artifact snc to HDFS from a definition in airflow dag config? [15:57:32] Or more precsisely, how can I check that it'll will work based on names etc [16:23:11] RECOVERY - SSH on analytics1077.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:44:33] so what are the coordinates of this 'batcave' you all speak off? :) [16:49:04] oh joal, missed your ping! [16:49:57] np mforns - will ping again after meeting :) [16:51:32] ok [16:53:13] REady I am mforns :) [16:53:18] mforns: batcave? [16:53:30] yep joal :] [17:28:17] (03CR) 10Kosta Harlan: [C: 03+2] analytics/mediawiki/accountcreation/block: Re-add required flags [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/833857 (https://phabricator.wikimedia.org/T317343) (owner: 10Gergő Tisza) [17:28:53] (03Merged) 10jenkins-bot: analytics/mediawiki/accountcreation/block: Re-add required flags [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/833857 (https://phabricator.wikimedia.org/T317343) (owner: 10Gergő Tisza) [17:29:57] ok mforns - https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/144 is ready to review, the artifact has been tested and all [17:30:23] mforns: with your permission, and if the code is ok, I'll merge/deploy this when I get back later tonight [17:30:57] In the meatime, gone for evening! [17:32:25] joal, sure go ahead! [17:43:26] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 01), 10Patch-For-Review: Build and install spark3 assembly - https://phabricator.wikimedia.org/T310578 (10Ottomata) TODO: are we sure we want to call this a '.jar' file? A jar is a zip, but i wouldn't expect a .jar file to co... [17:51:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:55:03] PROBLEM - SSH on analytics1076.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:56:03] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:00:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:06:13] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:56:05] RECOVERY - SSH on analytics1076.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:03:27] 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Engineering, 10Patch-For-Review: Deploy an-test-coord1002 to facilitate failover testing of analytics coordinator role - https://phabricator.wikimedia.org/T287864 (10odimitrijevic) [19:03:29] 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Operational Excellence - Q2 21/22 - https://phabricator.wikimedia.org/T288250 (10odimitrijevic) [19:04:14] 10Analytics-Jupyter, 10Data-Engineering: Autocomplete is very slow (unusable) in Newpyter - https://phabricator.wikimedia.org/T290008 (10odimitrijevic) [19:04:16] 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Operational Excellence - Q2 21/22 - https://phabricator.wikimedia.org/T288250 (10odimitrijevic) [19:04:43] 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Operational Excellence - Q2 21/22 - https://phabricator.wikimedia.org/T288250 (10odimitrijevic) 05Open→03Resolved a:03odimitrijevic [19:08:36] Any idea if the X-Analytics header makes it to the mw application servers? I'm interested in using the public_cloud=1 marker to put a class of search requests into a separate bucket that gets less concurrent requests allowed to ensure we preserve capacity for humans [19:13:03] (03CR) 10Xcollazo: "Follow up for the CREATE statement moves. There will be one more patch set after this one, making this a legit saga 😄." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/833822 (owner: 10Xcollazo) [19:35:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp5011 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5011%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [19:40:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5011 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5011%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [20:22:45] 10Data-Engineering, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0:Wikistats 2 service - https://phabricator.wikimedia.org/T288301 (10Milimetric) @BPirkle: that's well summarized, I think you captured the messiness of this in ways that I didn't see as we were building it. I think t... [20:23:03] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 01), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Milimetric) (sorry, submitted too soon, still editing, don't read :)) [20:50:13] 10Data-Engineering, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0:Wikistats 2 service - https://phabricator.wikimedia.org/T288301 (10Milimetric) @BPirkle: that's well summarized, I think you captured the messiness of this in ways that I didn't see as we were building it. I think t... [20:53:47] !log Deploy analytics airflow-dags to try to fix cassandra loading jobs [20:53:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:25:17] (03PS1) 10Joal: Remove cassandra-connector from refinery-job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/834382 [21:25:52] (03CR) 10Joal: [C: 03+2] "Self merging to unlock cassandra loading" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/834382 (owner: 10Joal) [21:26:17] (03CR) 10Joal: [V: 03+2 C: 03+2] Remove cassandra-connector from refinery-job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/834382 (owner: 10Joal) [21:34:33] (03Merged) 10jenkins-bot: Remove cassandra-connector from refinery-job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/834382 (owner: 10Joal) [21:43:51] (03PS1) 10Joal: Bump changelog to 0.2.7 for release [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/834387 [21:44:11] (03CR) 10Joal: [V: 03+2 C: 03+2] "Self merging for deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/834387 (owner: 10Joal) [21:44:39] Starting build #112 for job analytics-refinery-maven-release-docker [21:56:12] (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp5015 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [21:57:02] Project analytics-refinery-maven-release-docker build #112: 09SUCCESS in 12 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/112/ [22:00:17] PROBLEM - SSH on analytics1076.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:01:12] (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp5015 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [22:20:06] !log Deploy airflow for cassandra-loading patch [22:20:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:26:37] !log Kill oozie cassandra monthly loading jobs as we migrate them to airflow [22:26:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:29:20] Ok done for tonight :) [23:01:35] RECOVERY - SSH on analytics1076.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook