[04:22:39] PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:26:21] PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_immediate on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:18:51] good morning folks [06:18:52] Sep 22 23:15:01 an-launcher1002 kerberos-run-command[21386]: Error: A JNI error has occurred, please check your installation and try again [06:18:55] Sep 22 23:15:01 an-launcher1002 kerberos-run-command[21386]: Exception in thread "main" java.lang.NoClassDefFoundError: scala/Function2 [06:24:11] (this was the hdfs-cleaner script) [07:02:10] super weird elukey :( [07:02:37] bonjour :( [07:03:02] team: I'm not well today, I have some food-poisoning symptoms - I'm gonna stay in bed mostly [07:03:30] ouch, get better joal [10:49:01] hello folks, we had an issue with the puppet compiler in the past couple of days, since SRE is now filtering puppetdb's facts for performance reasons (pcc uses puppetdb) [10:49:32] I asked John this morning to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/723141/ otherwise most of analytics code would fail with pcc [10:50:07] I don't recall other usages of facts that may be problematic, but please double check as well and/or report if you see a strange pcc result :) [12:20:54] 10Analytics-Radar, 10Privacy Engineering, 10WMDE-Analytics-Engineering, 10Wikidata, 10Wikidata Analytics: Privacy Policy Review for Global South Wikidata edits and active editors datasets - https://phabricator.wikimedia.org/T291186 (10Manuel) Hi @Htriedman! Thank you for your answer! > Who exactly are... [13:36:29] 10Analytics, 10Analytics-Kanban: Test snapshot-reload from all instances using pageview-top data table - https://phabricator.wikimedia.org/T291473 (10JAllemandou) Testing done - top table doesn't present any missing row \o/. We can safely proceed with the plan: - snapshot+transfer+load small tables (except pa... [13:36:40] 10Analytics, 10Analytics-Kanban: Test snapshot-reload from all instances using pageview-top data table - https://phabricator.wikimedia.org/T291473 (10JAllemandou) [13:36:51] 10Analytics, 10Analytics-Kanban: Test snapshot-reload from all instances using pageview-top data table - https://phabricator.wikimedia.org/T291473 (10JAllemandou) 05In progress→03Resolved [13:36:53] 10Analytics, 10Analytics-Kanban: Check AQS with cassandra (serving + data) - https://phabricator.wikimedia.org/T290068 (10JAllemandou) [13:37:15] For when you're back btullis: test of top table is good - we can proceed with the plan :) [13:48:01] (03CR) 10Ottomata: [C: 03+1] Migrate MobileWikiAppDailyStats to MEP Bug: T286000 Change-Id: Ibf65637926bb3be80c64b2c373958a52b034aedb (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/722964 (https://phabricator.wikimedia.org/T286000) (owner: 10Sharvaniharan) [13:54:12] (03CR) 10Ottomata: [C: 03+1] Migrate MobileWikiAppDailyStats to MEP Bug: T286000 Change-Id: Ibf65637926bb3be80c64b2c373958a52b034aedb (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/722964 (https://phabricator.wikimedia.org/T286000) (owner: 10Sharvaniharan) [14:15:33] 10Analytics, 10Analytics-Kanban, 10Traffic: Review use of realloc in varnishkafka - https://phabricator.wikimedia.org/T287561 (10odimitrijevic) p:05Triage→03Low [14:40:59] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Metrics-Platform, 10Patch-For-Review: wgEventStreams (EventStreamConfig) should support per wiki overrides - https://phabricator.wikimedia.org/T277193 (10Ottomata) a:03Ottomata [15:03:45] joal: Excellent! I'll start adapating my scripts to take 12 snapshots, transfer them, move, them and reload them. [15:05:26] 10Analytics, 10Analytics-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) Repair progress is now 70% of snapshot 2 of 4. ` [2021-09-23 15:02:32,583] Repair session 6bd71c70-1b86-11ec-8d9d-cbcce8d668d2 for range (-5890521129570253646,-... [15:20:34] 10Analytics, 10Analytics-Kanban: Snapshot and Reload cassandra2 pageview_per_file data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) 05Open→03In progress [15:20:36] 10Analytics, 10Analytics-Kanban: Test snapshot-reload from all instances using pageview-top data table - https://phabricator.wikimedia.org/T291473 (10BTullis) [15:20:38] 10Analytics, 10Analytics-Kanban: Check AQS with cassandra (serving + data) - https://phabricator.wikimedia.org/T290068 (10BTullis) [15:42:08] 10Analytics, 10Analytics-Kanban: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 (10BTullis) Now that the repair has completed, we can carry out the next steps. Building upon the experience from {T291473} I will use scripts as much as possible in ord... [15:57:45] 10Analytics, 10Event-Platform, 10SRE, 10Wikimedia-Logstash, and 2 others: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645 (10Ottomata) [16:13:59] folks I am rebooting an-worker1096 to check why a new disk doesn't show up correctly [16:14:13] there were no containers running so in theory no impact [16:29:07] 10Analytics, 10Data-Engineering: Analytics-hadoop Spark3 package upgrade (production) - https://phabricator.wikimedia.org/T291466 (10odimitrijevic) [16:29:27] 10Analytics, 10Data-Engineering: Analytics-test-hadoop Spark3 package upgrade - https://phabricator.wikimedia.org/T291465 (10odimitrijevic) [16:33:40] 10Analytics, 10Data-Engineering: Upgrade analytics-hadoop to Spark 3 + scala 2.12 - https://phabricator.wikimedia.org/T291464 (10odimitrijevic) [16:37:09] 10Analytics, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs, 10serviceops: Better observability/visualization for jobs - https://phabricator.wikimedia.org/T291620 (10Michael) Adding #platform_engineering #serviceops and #analytics as this is related to all three teams. I'm aware that... [16:37:31] 10Analytics, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1096 - https://phabricator.wikimedia.org/T290805 (10elukey) New disk up and running, I added some more info to https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Swapping_broken_disk (in this case there was no unconfi... [16:37:40] ok an-worker1096 has the new disk [16:38:01] https://phabricator.wikimedia.org/T290805 was closed by DCOps and we didn't follow up :( [16:38:08] elukey: Thanks ^ [16:40:50] np :) [16:41:17] btullis, razzi - any timeline for https://phabricator.wikimedia.org/T288625 ? I can help if you are busy [16:41:31] (it should be mostly running cookbooks) [16:42:44] I'm not working on Java security updates myself - could use help [16:43:06] Focusing on quarterly goals for the next week, superset / presto improvements [16:44:56] sure sure, but the cookbooks can run in parallel with low priority, I think that it is fine to run them and do something else (checking every now and then) [16:45:17] they are pretty safe by now, we ran then a lot of times :) [16:48:11] My brain is single threaded 😛but I can try to pick off some of them [16:48:44] or should i say, my train of thought is single tracked (brain is GPU+-parallel) [16:52:42] sure not saying "all in one day", but a couple per day, something like that :) [16:53:53] I'll pick some restarts up to help out during the next days [16:54:06] Me too. [17:17:09] We can do it elukey. You've got enough on your plate :-) [17:25:31] mforns: you wanna now? [17:25:36] ottomata: sure! [17:25:41] bc? [17:25:43] back to bc [17:31:23] 10Analytics, 10Analytics-Kanban: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 (10BTullis) **Snapshot command:** We need snapshots across 10 keyspaces on 4 instances ` sudo cumin 'aqs100[4,7].eqiad.wmnet' --mode async 'nodetool-a snapshot -t T2914... [18:12:15] 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, 10ops-eqiad: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10Cmjohnson) [18:12:32] 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, 10ops-eqiad: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10Cmjohnson) updated dns and network [18:52:05] 10Analytics, 10Event-Platform, 10serviceops: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Ottomata) We tried to deploy this today, but ran into an issue: Since the k8s resources have been renamed, k8s thinks t... [18:53:40] 10Analytics, 10Event-Platform, 10serviceops: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Pchelolo) > To deploy, we are going to have to depool a DC, delete the existing deployment, apply the new one, then repo... [18:55:19] 10Analytics, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs, 10serviceops: Better observability/visualization for jobs - https://phabricator.wikimedia.org/T291620 (10Ottomata) Data Eng (analytics) is in the process of [[ https://phabricator.wikimedia.org/T282033 | solving on a simila... [19:11:06] (03CR) 10Sharvaniharan: Migrate MobileWikiAppDailyStats to MEP Bug: T286000 Change-Id: Ibf65637926bb3be80c64b2c373958a52b034aedb (034 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/722964 (https://phabricator.wikimedia.org/T286000) (owner: 10Sharvaniharan) [20:02:24] 10Analytics, 10Data-Engineering: Set hive.warehouse.subdir.inherit.perms to false - https://phabricator.wikimedia.org/T291664 (10Ottomata) [20:05:21] 10Analytics, 10Event-Platform, 10serviceops: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Ottomata) Oof right. I've already merged the eventgate chart change, and I think to rollback we'd have to revert and th... [20:06:39] 10Analytics, 10Event-Platform, 10serviceops: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Pchelolo) why rollback? we just make the same changes to eventstreams before going through the deployment [20:15:41] (03CR) 10Ottomata: [C: 03+1] Migrate MobileWikiAppDailyStats to MEP Bug: T286000 Change-Id: Ibf65637926bb3be80c64b2c373958a52b034aedb (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/722964 (https://phabricator.wikimedia.org/T286000) (owner: 10Sharvaniharan) [20:16:58] 10Analytics, 10Event-Platform, 10serviceops: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Ottomata) I'm worried that in the meantime someone will need to make an emergency fix/change to eventgate and won't be a... [20:21:47] 10Analytics, 10Event-Platform, 10serviceops: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Pchelolo) oh, yeah. ok. up to you. [20:29:16] (03CR) 10Bstorm: "On trying this, it doesn't display query results any more. I'm trying to determine for sure if that is my environment or not." [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719567 (https://phabricator.wikimedia.org/T289349) (owner: 10Michael DiPietro) [20:38:53] 10Analytics, 10Data-Engineering: SPIKE - Will Hadoop 3 container support help us for Airflow deployment pipelines? - https://phabricator.wikimedia.org/T288247 (10Ottomata) Also > Docker Container Executor runs in non-secure mode of HDFS and YARN. It will not run in secure mode, and will exit if it detects sec... [20:39:16] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: SPIKE - Will Hadoop 3 container support help us for Airflow deployment pipelines? - https://phabricator.wikimedia.org/T288247 (10Ottomata) [20:42:28] (03CR) 10Michael DiPietro: add stop status (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719567 (https://phabricator.wikimedia.org/T289349) (owner: 10Michael DiPietro)