[01:35:27] (SystemdUnitFailed) firing: (9) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:16:43] (DiskSpace) firing: Disk space an-test-worker1002:9100:/ 0.05584% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [02:40:12] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:05:12] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:15:12] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:16:43] (DiskSpace) firing: Disk space an-test-worker1002:9100:/ 0.0001886% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [06:38:58] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for fatwiki - https://phabricator.wikimedia.org/T335018 (10Marostegui) Database `_p` created and grants created. This is ready for views creation. [06:39:28] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for kcgwiktionary - https://phabricator.wikimedia.org/T334739 (10Marostegui) Database `_p` created and grants created. This is ready for views creation. [06:39:53] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for guwwikinews - https://phabricator.wikimedia.org/T334408 (10Marostegui) Database `_p` created and grants created. This is ready for views creation. [06:40:16] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for kbdwiktionary - https://phabricator.wikimedia.org/T333270 (10Marostegui) [08:15:12] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:43:32] (GobblinLastSuccessfulRunTooLongAgo) firing: (2) Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [08:48:32] (GobblinLastSuccessfulRunTooLongAgo) firing: (3) Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [08:57:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [09:12:29] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [10:16:43] (DiskSpace) firing: Disk space an-test-worker1002:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:07:30] Hi btullis - would ou have aminute for me? [12:15:12] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:41:35] joal: patch looks good, looks like we need a kerberos keytab for an-web1001 [12:41:36] https://puppet-compiler.wmflabs.org/output/910761/40829/an-web1001.eqiad.wmnet/change.an-web1001.eqiad.wmnet.err [12:41:38] will make one... [12:41:51] ottomata: indeed - I didn't know how to do it [12:42:13] https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Kerberos#Create_a_keytab_for_a_service [12:42:20] thank you [12:42:28] joal: actually, to make PCC work, we need a dummy keytab in 'labs-privaet' [12:42:30] can you make that tone [12:42:32] ? [12:42:41] currently in meeting ottomata :S [12:43:23] e.g. https://gerrit.wikimedia.org/r/plugins/gitiles/labs/private/+/refs/heads/master/modules/secret/secrets/kerberos/keytabs/an-test-client1001.eqiad.wmnet/analytics/analytics.keytab [12:43:25] okay can do [12:48:33] (GobblinLastSuccessfulRunTooLongAgo) firing: (3) Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [12:51:43] (DiskSpace) resolved: Disk space an-test-worker1002:9100:/ 0.02054% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [13:03:32] (GobblinLastSuccessfulRunTooLongAgo) resolved: (3) Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [13:04:14] joal: i don't understand how the current hdfs rsync job on dumps server works with user = 'dumpsgen' [13:04:40] ottomata: I think there is a keytab for them [13:04:46] oh really...? [13:04:46] nm [13:04:46] and the data is readable by all [13:04:48] hm [13:05:01] well well there sure is [13:05:12] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:13] okay i guess i need a keytab for stats user, not analytics :/ [13:06:36] !log killed the gobblin-eventlogging_legacy_test on an-test-coord1001 [13:06:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:07:52] !log restarted the gobblin-eventlogging_legacy_test on an-test-coord1001 [13:07:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:10:53] Thank you ottomata for the keytab creation [13:11:35] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10JArguello-WMF) [13:13:29] joal still workignon it, in meetinsg now [13:13:43] np ottomata - thank you :) [13:46:03] 10Data-Engineering, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): an-test-worker1002 is constantly writing to /tmp - https://phabricator.wikimedia.org/T335358 (10BTullis) [13:46:32] 10Data-Engineering, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): an-test-worker1002 is constantly writing to /tmp - https://phabricator.wikimedia.org/T335358 (10BTullis) [13:47:31] !log rebooting an-test-worker1002 T335358 [13:47:32] 10Data-Engineering, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): an-test-worker1002 is constantly writing to /tmp - https://phabricator.wikimedia.org/T335358 (10ops-monitoring-bot) Host rebooted by btullis@cumin1001 with reason: Investigating excessive writing to /tmp [13:47:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:47:34] T335358: an-test-worker1002 is constantly writing to /tmp - https://phabricator.wikimedia.org/T335358 [13:47:50] 10Data-Engineering, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): an-test-worker1002 is constantly writing to /tmp - https://phabricator.wikimedia.org/T335358 (10BTullis) p:05Triage→03Medium [13:50:47] 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Tchanders) Thanks @Ladsgroup . I'd be happy to go with this, but before we do, I'd like to hear from @tstarling and/or @daniel first, since they originally decided agains... [13:59:56] (03PS1) 10Btullis: Add guw.wikinews and kbd.wiktionary to the allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911869 (https://phabricator.wikimedia.org/T334459) [14:06:15] 10Data-Engineering, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): an-test-worker1002 is constantly writing to /tmp - https://phabricator.wikimedia.org/T335358 (10JArguello-WMF) [14:08:05] (03CR) 10Joal: [C: 04-1] "Separators issue" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911869 (https://phabricator.wikimedia.org/T334459) (owner: 10Btullis) [14:15:42] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): Deploy ceph radosgw processes to data-engineering cluster - https://phabricator.wikimedia.org/T330152 (10JArguello-WMF) [14:17:05] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12): Event Driven Enrichment Pipelines repositories should be generated from a template - https://phabricator.wikimedia.org/T324980 (10Ottomata) [14:20:31] 10Data-Engineering-Planning, 10Data Pipelines, 10Shared-Data-Infrastructure: [Iceberg] Debianize and install iceberg support for Spark, Presto, and optionally Hive - https://phabricator.wikimedia.org/T311738 (10JArguello-WMF) [14:22:23] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10JArguello-WMF) [14:23:00] 10Data-Engineering-Planning, 10Infrastructure-Foundations, 10Shared-Data-Infrastructure: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10JArguello-WMF) [14:25:57] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): Deploy ceph radosgw processes to data-engineering cluster - https://phabricator.wikimedia.org/T330152 (10JArguello-WMF) p:05Triage→03High [14:31:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [14:36:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [14:44:13] 10Data-Engineering-Planning, 10Data Pipelines, 10Shared-Data-Infrastructure: [Iceberg] Debianize and install iceberg support for Spark, Presto, and optionally Hive - https://phabricator.wikimedia.org/T311738 (10JArguello-WMF) [14:46:11] (03CR) 10Aqu: "Looks good. 2 non blocking comments." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910092 (https://phabricator.wikimedia.org/T334096) (owner: 10Mforns) [14:47:12] 10Data-Engineering-Planning, 10Data Pipelines, 10Shared-Data-Infrastructure: [Iceberg] Debianize and install iceberg support for Spark, Presto, and optionally Hive - https://phabricator.wikimedia.org/T311738 (10JArguello-WMF) [14:47:21] 10Data-Engineering, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): an-test-worker1002 is constantly writing to /tmp - https://phabricator.wikimedia.org/T335358 (10BTullis) 05Open→03Resolved Upon reboot, this behaviour seems to have stopped happening and the host is back to normal. {F36962796,w... [14:48:15] ottomata: I see you've merged the patch for synchronization of HDFS to an-web [14:48:22] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:26] ottomata: Thank you for that :) [14:48:48] ottomata: woops - actually we have an error as I write :S [14:49:04] ottomata: I wondered about file group-ownership [14:49:35] I'm fuelling up the analytics deployment train at the moment. I see four commits to refinery to be deployed: https://etherpad.wikimedia.org/p/analytics-weekly-train [14:50:04] Anything else for anyone? [14:50:12] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:50:57] (03CR) 10Btullis: Add guw.wikinews and kbd.wiktionary to the allowlist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911869 (https://phabricator.wikimedia.org/T334459) (owner: 10Btullis) [14:51:48] (03CR) 10Joal: [C: 04-1] Add guw.wikinews and kbd.wiktionary to the allowlist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911869 (https://phabricator.wikimedia.org/T334459) (owner: 10Btullis) [14:53:30] (03PS2) 10Btullis: Add guw.wikinews and kbd.wiktionary to the allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911869 (https://phabricator.wikimedia.org/T334459) [14:54:59] (03CR) 10Joal: Add guw.wikinews and kbd.wiktionary to the allowlist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911869 (https://phabricator.wikimedia.org/T334459) (owner: 10Btullis) [14:55:38] Running for errand, will be back at standup time [14:56:02] (03PS3) 10Btullis: Add guw.wikinews and kbd.wiktionary to the allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911869 (https://phabricator.wikimedia.org/T334459) [15:01:03] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:12] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:05:45] (03CR) 10Ottomata: [V: 03+2 C: 03+2] add an-web1001 to list of targets [analytics/hdfs-tools/deploy] - 10https://gerrit.wikimedia.org/r/911788 (owner: 10Ottomata) [15:27:35] btullis, I've added this puppet patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/908777/ to the train in Etherpad. Thanks ! [15:45:35] aqu: Many thanks. [15:46:14] (03PS1) 10Mforns: Migrate queries for webrequest_sampled_128 to /hql (Airflow/Spark3) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911890 (https://phabricator.wikimedia.org/T334106) [15:49:57] (03PS2) 10Mforns: Migrate queries for webrequest_sampled_128 to /hql (Airflow/Spark3) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911890 (https://phabricator.wikimedia.org/T334106) [15:51:13] (03PS3) 10Mforns: Migrate queries for webrequest_sampled_128 to /hql (Airflow/Spark3) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911890 (https://phabricator.wikimedia.org/T334106) [15:55:54] (03PS3) 10Mforns: Migrate unique devices druid loading queries to Airflow/SparkSQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910092 (https://phabricator.wikimedia.org/T334096) [15:56:40] (03CR) 10Mforns: "Thank you for the review @aqu! Made the suggested changes" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910092 (https://phabricator.wikimedia.org/T334096) (owner: 10Mforns) [16:15:36] ottomata: I'm sorry I forgot that bit about deploying hdfs_tools onto web [16:16:07] ottomata: the hdfs_rsync had worked when I checked though [16:16:29] ottomata: I don't understand why hqardsync ffailed though [16:54:47] hardsync failed? looking [17:03:06] joal: it works for me, everyhting looks fine. [17:03:06] except [17:03:11] i did [17:03:25] echo 'hello' | hdfs dfs -put - /wmf/data/published/datasets/tmp1.txt [17:03:37] hdfs-rsync failed beacuse the default file perms were not readable [17:04:21] hm, must be because of the default umask for HDFS I assume [17:05:03] ottomata: we had an error earlier on about hardsync failure - must have succeeded after (and failed originall for a reason I didn't understand) [17:05:12] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:05:16] I didn't get around to the deployment train today after all. Are we OK if I do it tomorrow morning UK time? [17:06:04] ottomata: you have removed your file, right? [17:08:55] ya i removed it [17:09:06] mwarf [17:10:38] i mean its not so bad, we can just ask folks to chmod o+r their files when they put them there? [17:10:45] or, make a cron that just does that? [17:10:49] That's what I am pondering [17:11:05] It is worth having a job doing this? [17:11:21] The wrong side of thing is that people forgetting to do it breaks hdfs_rsync [17:11:24] :( [17:11:26] true. [17:11:54] Could we make the `stats` user being part of analytics-private-data? [17:12:01] Posix ACLs ? [17:12:04] Or make that folder use a different group [17:14:54] I know I've mentioned it before, but this sounds like the ideal use case. `hdfs dfs -setfacl -m default:user: /wmf/data/published/datasets` [17:19:02] why not btullis! [17:19:16] ottomata: any reason not to do it with ACLs? [17:22:11] joal: I thinkwe could hdfs-rsync witht the analytics user? hmmm [17:22:48] ottomata: why not! [17:22:57] dunno if that would break hardsync... [17:22:58] any solution works for me :) [17:23:00] i guess not? [17:23:19] btullis: acls could work...but we have to manage them [17:23:28] Data would be group readable, but all-readable on the local host I assume [17:23:45] i tihnk so... [17:24:45] Not all-reasable sorry [17:25:24] hm, actually I could make hdfs-rsync force the perms to be all-readable on the local machine IIRC [17:26:58] actually this bit is already done [17:27:06] does itt currently copy perms from source? [17:27:19] So if the analytics user pulls from HDFS, we should be fine on the web host [17:27:57] ottomata: see https://github.com/wikimedia/operations-puppet/blob/production/modules/hdfs_tools/manifests/hdfs_rsync_job.pp#L67 [17:28:08] We do: --perms --chmod D755,F644 [17:28:08] D755: Add visual error handling to dashboard and detail - https://phabricator.wikimedia.org/D755 [17:28:25] So we force perms to be 755 for dirs and 644 for files [17:28:45] (03CR) 10Joal: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911869 (https://phabricator.wikimedia.org/T334459) (owner: 10Btullis) [17:29:15] oh! [17:29:16] okay [17:29:47] so the an-web local data is normally ok, and pulling it from the analytics user should do the trick [17:31:08] hmm joal i think it is okay to hdfs-rsync wiht delete [17:31:16] wait...is it? [17:31:50] ottomata: I assume it is, it'll just mean data will be deleted from the local published-rsynced folder, but kept in the hardsync one [17:32:02] So still published [17:32:05] yes right. [17:32:16] we do --delete in the regular rsync commands [17:32:32] Ok, let's do it for hdfs-rsync then [17:32:42] k [17:32:46] am making a patch [17:32:48] will include it [17:32:52] I assume having the --delete makes it easier to manually delete (only one place instead of 2) [17:32:59] thanks a lot ottomata <3 [17:33:11] we should rename the systmed timer too, its confusing right now [17:33:15] will take some manual clenaup, will do [17:33:41] Oh true! my bad ottomata - I'm sorry for that [17:34:36] i missed it too [17:40:42] https://gerrit.wikimedia.org/r/c/operations/puppet/+/911913/ [17:44:14] ottomata: commented [17:52:36] oof this cleanup will be more complicated than I thought, making patch to ensrue absent the old ones... [18:04:21] joal it works! [18:04:32] \o/ [18:04:39] i had to chown the an-web dir to analytics:root to allow the hdfs_rsycn command to write there [18:04:43] but it works now! [18:05:09] Awesome :) Thanks a lot for fixing this ottomata [18:05:12] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:05:27] sure thing will be useful! i guess, can you update the wiktech docs? [18:06:07] https://wikitech.wikimedia.org/wiki/Analytics/Web_publication [18:06:23] I'll do ottomata - I'm not we have some though - I'll create some [18:06:29] ^^ [18:06:41] ack ottomata - will do [18:10:33] ty! [19:11:32] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 12): Deprecate old mobile datasets - https://phabricator.wikimedia.org/T329310 (10mforns) a:03mforns [19:11:52] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 12): Delete empty tables unique_devices_*_wide_* - https://phabricator.wikimedia.org/T329978 (10mforns) a:03mforns [19:23:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [19:43:28] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [22:05:12] (SystemdUnitFailed) firing: (9) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:38:31] (03CR) 10Kimberly Sarabia: "Patch to get a schema fragment for web team owned schemas per your request." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/911412 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia) [22:45:33] (03CR) 10Jdlrobson: "Clare is this something you could help us test and review?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/911412 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia) [22:46:27] (03CR) 10Jdlrobson: Creates web schema fragment (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/911412 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia) [22:49:40] (03CR) 10Clare Ming: Creates web schema fragment (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/911412 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia) [22:54:55] (03CR) 10Clare Ming: Creates web schema fragment (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/911412 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia) [22:55:29] (03CR) 10Clare Ming: Creates web schema fragment (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/911412 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia)