[01:35:27] <jinxer-wm>	 (SystemdUnitFailed) firing: (9) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:16:43] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-worker1002:9100:/ 0.05584% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[02:40:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:05:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:15:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:16:43] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-worker1002:9100:/ 0.0001886% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[06:38:58] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for fatwiki - https://phabricator.wikimedia.org/T335018 (10Marostegui) Database `_p` created and grants created. This is ready for views creation.
[06:39:28] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for kcgwiktionary - https://phabricator.wikimedia.org/T334739 (10Marostegui) Database `_p` created and grants created. This is ready for views creation.
[06:39:53] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for guwwikinews - https://phabricator.wikimedia.org/T334408 (10Marostegui) Database `_p` created and grants created. This is ready for views creation.
[06:40:16] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for kbdwiktionary - https://phabricator.wikimedia.org/T333270 (10Marostegui)
[08:15:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:43:32] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: (2) Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[08:48:32] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: (3) Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[08:57:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[09:12:29] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[10:16:43] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-worker1002:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[12:07:30] <joal>	 Hi btullis - would ou have aminute for me?
[12:15:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:41:35] <ottomata>	 joal:  patch looks good, looks like we need a kerberos keytab for an-web1001
[12:41:36] <ottomata>	 https://puppet-compiler.wmflabs.org/output/910761/40829/an-web1001.eqiad.wmnet/change.an-web1001.eqiad.wmnet.err
[12:41:38] <ottomata>	 will make one...
[12:41:51] <joal>	 ottomata: indeed - I didn't know how to do it
[12:42:13] <ottomata>	 https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Kerberos#Create_a_keytab_for_a_service
[12:42:20] <joal>	 thank you
[12:42:28] <ottomata>	 joal:  actually, to make PCC work, we need a dummy keytab in 'labs-privaet'
[12:42:30] <ottomata>	 can you make that tone
[12:42:32] <ottomata>	 ?
[12:42:41] <joal>	 currently in meeting ottomata :S
[12:43:23] <ottomata>	 e.g. https://gerrit.wikimedia.org/r/plugins/gitiles/labs/private/+/refs/heads/master/modules/secret/secrets/kerberos/keytabs/an-test-client1001.eqiad.wmnet/analytics/analytics.keytab
[12:43:25] <ottomata>	 okay can do
[12:48:33] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: (3) Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[12:51:43] <jinxer-wm>	 (DiskSpace) resolved: Disk space an-test-worker1002:9100:/ 0.02054% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[13:03:32] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) resolved: (3) Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[13:04:14] <ottomata>	 joal:  i don't understand how the current hdfs rsync job on dumps server works with user = 'dumpsgen'
[13:04:40] <joal>	 ottomata: I think there is a keytab for them
[13:04:46] <ottomata>	 oh really...?
[13:04:46] <ottomata>	 nm
[13:04:46] <joal>	 and the data is readable by all
[13:04:48] <ottomata>	 hm
[13:05:01] <ottomata>	 well well there sure is
[13:05:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:05:13] <ottomata>	 okay i guess i need a keytab for stats user, not analytics :/
[13:06:36] <btullis>	 !log killed the gobblin-eventlogging_legacy_test on an-test-coord1001
[13:06:38] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:07:52] <btullis>	 !log restarted the gobblin-eventlogging_legacy_test on an-test-coord1001
[13:07:53] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:10:53] <joal>	 Thank you ottomata for the keytab creation
[13:11:35] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10JArguello-WMF)
[13:13:29] <ottomata>	 joal still workignon it, in meetinsg now 
[13:13:43] <joal>	 np ottomata - thank you :)
[13:46:03] <wikibugs>	 10Data-Engineering, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): an-test-worker1002 is constantly writing to /tmp - https://phabricator.wikimedia.org/T335358 (10BTullis)
[13:46:32] <wikibugs>	 10Data-Engineering, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): an-test-worker1002 is constantly writing to /tmp - https://phabricator.wikimedia.org/T335358 (10BTullis)
[13:47:31] <btullis>	 !log rebooting an-test-worker1002 T335358
[13:47:32] <wikibugs>	 10Data-Engineering, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): an-test-worker1002 is constantly writing to /tmp - https://phabricator.wikimedia.org/T335358 (10ops-monitoring-bot) Host rebooted by btullis@cumin1001 with reason: Investigating excessive writing to /tmp
[13:47:33] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:47:34] <stashbot>	 T335358: an-test-worker1002 is constantly writing to /tmp  - https://phabricator.wikimedia.org/T335358
[13:47:50] <wikibugs>	 10Data-Engineering, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): an-test-worker1002 is constantly writing to /tmp - https://phabricator.wikimedia.org/T335358 (10BTullis) p:05Triage→03Medium
[13:50:47] <wikibugs>	 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Tchanders) Thanks @Ladsgroup . I'd be happy to go with this, but before we do, I'd like to hear from @tstarling and/or @daniel first, since they originally decided agains...
[13:59:56] <wikibugs>	 (03PS1) 10Btullis: Add guw.wikinews and kbd.wiktionary to the allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911869 (https://phabricator.wikimedia.org/T334459)
[14:06:15] <wikibugs>	 10Data-Engineering, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): an-test-worker1002 is constantly writing to /tmp - https://phabricator.wikimedia.org/T335358 (10JArguello-WMF)
[14:08:05] <wikibugs>	 (03CR) 10Joal: [C: 04-1] "Separators issue" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911869 (https://phabricator.wikimedia.org/T334459) (owner: 10Btullis)
[14:15:42] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): Deploy ceph radosgw processes to data-engineering cluster - https://phabricator.wikimedia.org/T330152 (10JArguello-WMF)
[14:17:05] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12): Event Driven Enrichment Pipelines repositories should be generated from a template - https://phabricator.wikimedia.org/T324980 (10Ottomata)
[14:20:31] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Shared-Data-Infrastructure: [Iceberg] Debianize and install iceberg support for Spark, Presto, and optionally Hive - https://phabricator.wikimedia.org/T311738 (10JArguello-WMF)
[14:22:23] <wikibugs>	 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10JArguello-WMF)
[14:23:00] <wikibugs>	 10Data-Engineering-Planning, 10Infrastructure-Foundations, 10Shared-Data-Infrastructure: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10JArguello-WMF)
[14:25:57] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): Deploy ceph radosgw processes to data-engineering cluster - https://phabricator.wikimedia.org/T330152 (10JArguello-WMF) p:05Triage→03High
[14:31:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[14:36:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[14:44:13] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Shared-Data-Infrastructure: [Iceberg] Debianize and install iceberg support for Spark, Presto, and optionally Hive - https://phabricator.wikimedia.org/T311738 (10JArguello-WMF)
[14:46:11] <wikibugs>	 (03CR) 10Aqu: "Looks good. 2 non blocking comments." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910092 (https://phabricator.wikimedia.org/T334096) (owner: 10Mforns)
[14:47:12] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Shared-Data-Infrastructure: [Iceberg] Debianize and install iceberg support for Spark, Presto, and optionally Hive - https://phabricator.wikimedia.org/T311738 (10JArguello-WMF)
[14:47:21] <wikibugs>	 10Data-Engineering, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): an-test-worker1002 is constantly writing to /tmp - https://phabricator.wikimedia.org/T335358 (10BTullis) 05Open→03Resolved Upon reboot, this behaviour seems to have stopped happening and the host is back to normal. {F36962796,w...
[14:48:15] <joal>	 ottomata: I see you've merged the patch for synchronization of HDFS to an-web
[14:48:22] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:48:26] <joal>	 ottomata: Thank you for that :)
[14:48:48] <joal>	 ottomata: woops - actually we have an error as I write :S
[14:49:04] <joal>	 ottomata: I wondered about file group-ownership
[14:49:35] <btullis>	 I'm fuelling up the analytics deployment train at the moment. I see four commits to refinery to be deployed: https://etherpad.wikimedia.org/p/analytics-weekly-train
[14:50:04] <btullis>	 Anything else for anyone? 
[14:50:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:50:57] <wikibugs>	 (03CR) 10Btullis: Add guw.wikinews and kbd.wiktionary to the allowlist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911869 (https://phabricator.wikimedia.org/T334459) (owner: 10Btullis)
[14:51:48] <wikibugs>	 (03CR) 10Joal: [C: 04-1] Add guw.wikinews and kbd.wiktionary to the allowlist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911869 (https://phabricator.wikimedia.org/T334459) (owner: 10Btullis)
[14:53:30] <wikibugs>	 (03PS2) 10Btullis: Add guw.wikinews and kbd.wiktionary to the allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911869 (https://phabricator.wikimedia.org/T334459)
[14:54:59] <wikibugs>	 (03CR) 10Joal: Add guw.wikinews and kbd.wiktionary to the allowlist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911869 (https://phabricator.wikimedia.org/T334459) (owner: 10Btullis)
[14:55:38] <joal>	 Running for errand, will be back at standup time
[14:56:02] <wikibugs>	 (03PS3) 10Btullis: Add guw.wikinews and kbd.wiktionary to the allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911869 (https://phabricator.wikimedia.org/T334459)
[15:01:03] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:05:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:05:45] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] add an-web1001 to list of targets [analytics/hdfs-tools/deploy] - 10https://gerrit.wikimedia.org/r/911788 (owner: 10Ottomata)
[15:27:35] <aqu>	 btullis, I've added this puppet patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/908777/ to the train in Etherpad. Thanks !
[15:45:35] <btullis>	 aqu: Many thanks. 
[15:46:14] <wikibugs>	 (03PS1) 10Mforns: Migrate queries for webrequest_sampled_128 to /hql (Airflow/Spark3) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911890 (https://phabricator.wikimedia.org/T334106)
[15:49:57] <wikibugs>	 (03PS2) 10Mforns: Migrate queries for webrequest_sampled_128 to /hql (Airflow/Spark3) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911890 (https://phabricator.wikimedia.org/T334106)
[15:51:13] <wikibugs>	 (03PS3) 10Mforns: Migrate queries for webrequest_sampled_128 to /hql (Airflow/Spark3) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911890 (https://phabricator.wikimedia.org/T334106)
[15:55:54] <wikibugs>	 (03PS3) 10Mforns: Migrate unique devices druid loading queries to Airflow/SparkSQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910092 (https://phabricator.wikimedia.org/T334096)
[15:56:40] <wikibugs>	 (03CR) 10Mforns: "Thank you for the review @aqu! Made the suggested changes" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910092 (https://phabricator.wikimedia.org/T334096) (owner: 10Mforns)
[16:15:36] <joal>	 ottomata: I'm sorry I forgot that bit about deploying hdfs_tools onto web
[16:16:07] <joal>	 ottomata: the hdfs_rsync had worked when I checked though
[16:16:29] <joal>	 ottomata: I don't understand why hqardsync ffailed though
[16:54:47] <ottomata>	 hardsync failed?  looking
[17:03:06] <ottomata>	 joal:  it works for me, everyhting looks fine.
[17:03:06] <ottomata>	 except
[17:03:11] <ottomata>	 i did
[17:03:25] <ottomata>	 echo 'hello' | hdfs dfs -put -  /wmf/data/published/datasets/tmp1.txt
[17:03:37] <ottomata>	 hdfs-rsync failed beacuse the default file perms were not readable
[17:04:21] <joal>	 hm, must be because of the default umask for HDFS I assume
[17:05:03] <joal>	 ottomata: we had an error earlier on about hardsync failure - must have succeeded after (and failed originall for a reason I didn't understand)
[17:05:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:05:16] <btullis>	 I didn't get around to the deployment train today after all. Are we OK if I do it tomorrow morning UK time?
[17:06:04] <joal>	 ottomata: you have removed your file, right?
[17:08:55] <ottomata>	 ya i removed it
[17:09:06] <joal>	 mwarf
[17:10:38] <ottomata>	 i mean its not so bad, we can just ask folks to chmod o+r their files when they put them there?
[17:10:45] <ottomata>	 or, make a cron that just does that?
[17:10:49] <joal>	 That's what I am pondering
[17:11:05] <joal>	 It is worth having a job doing this?
[17:11:21] <joal>	 The wrong side of thing is that people forgetting to do it breaks hdfs_rsync
[17:11:24] <joal>	 :(
[17:11:26] <ottomata>	 true. 
[17:11:54] <joal>	 Could we make the `stats` user being part of analytics-private-data?
[17:12:01] <btullis>	 Posix ACLs ?
[17:12:04] <joal>	 Or make that folder use a different group
[17:14:54] <btullis>	 I know I've mentioned it before, but this sounds like the ideal use case. `hdfs dfs -setfacl -m default:user: <something something> /wmf/data/published/datasets`
[17:19:02] <joal>	 why not btullis!
[17:19:16] <joal>	 ottomata: any reason not to do it with ACLs?
[17:22:11] <ottomata>	 joal:  I thinkwe could hdfs-rsync witht the analytics user?  hmmm
[17:22:48] <joal>	 ottomata: why not!
[17:22:57] <ottomata>	 dunno if that would break hardsync...
[17:22:58] <joal>	 any solution works for me :)
[17:23:00] <ottomata>	 i guess not?
[17:23:19] <ottomata>	 btullis:  acls could work...but we have to manage them
[17:23:28] <joal>	 Data would be group readable, but all-readable on the local host I assume
[17:23:45] <ottomata>	 i tihnk so...
[17:24:45] <joal>	 Not all-reasable sorry
[17:25:24] <joal>	 hm, actually I could make hdfs-rsync force the perms to be all-readable on the local machine IIRC
[17:26:58] <joal>	 actually this bit is already done
[17:27:06] <ottomata>	 does itt currently copy perms from source?
[17:27:19] <joal>	 So if the analytics user pulls from HDFS, we should be fine on the web host
[17:27:57] <joal>	 ottomata: see https://github.com/wikimedia/operations-puppet/blob/production/modules/hdfs_tools/manifests/hdfs_rsync_job.pp#L67
[17:28:08] <joal>	 We do: --perms --chmod D755,F644
[17:28:08] <stashbot>	 D755: Add visual error handling to dashboard and detail - https://phabricator.wikimedia.org/D755
[17:28:25] <joal>	 So we force perms to be 755 for dirs and 644 for files
[17:28:45] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911869 (https://phabricator.wikimedia.org/T334459) (owner: 10Btullis)
[17:29:15] <ottomata>	 oh!
[17:29:16] <ottomata>	 okay
[17:29:47] <joal>	 so the an-web local data is normally ok, and pulling it from the analytics user should do the trick
[17:31:08] <ottomata>	 hmm joal  i think it is okay to hdfs-rsync wiht delete
[17:31:16] <ottomata>	 wait...is it?
[17:31:50] <joal>	 ottomata: I assume it is, it'll just mean data will be deleted from the local published-rsynced folder, but kept in the hardsync one
[17:32:02] <joal>	 So still published
[17:32:05] <ottomata>	 yes right.
[17:32:16] <ottomata>	 we do --delete in the regular rsync commands
[17:32:32] <joal>	 Ok, let's do it for hdfs-rsync then
[17:32:42] <ottomata>	 k
[17:32:46] <ottomata>	 am making a patch
[17:32:48] <ottomata>	 will include it
[17:32:52] <joal>	 I assume having the --delete makes it easier to manually delete (only one place instead of 2)
[17:32:59] <joal>	 thanks a lot ottomata <3
[17:33:11] <ottomata>	 we should rename the systmed timer too, its confusing right now
[17:33:15] <ottomata>	 will take some manual clenaup, will do 
[17:33:41] <joal>	 Oh true! my bad ottomata - I'm sorry for that
[17:34:36] <ottomata>	 i missed it too
[17:40:42] <ottomata>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/911913/
[17:44:14] <joal>	 ottomata: commented
[17:52:36] <ottomata>	 oof this cleanup will be more complicated than I thought, making patch to ensrue absent the old ones...
[18:04:21] <ottomata>	 joal it works!
[18:04:32] <joal>	 \o/
[18:04:39] <ottomata>	 i had to chown the an-web dir to analytics:root to allow the hdfs_rsycn command to write there
[18:04:43] <ottomata>	 but it works now!
[18:05:09] <joal>	 Awesome :) Thanks a lot for fixing this ottomata 
[18:05:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:05:27] <ottomata>	 sure thing will be useful!  i guess, can you update the wiktech docs?
[18:06:07] <ottomata>	 https://wikitech.wikimedia.org/wiki/Analytics/Web_publication
[18:06:23] <joal>	 I'll do ottomata - I'm not we have some though - I'll create some
[18:06:29] <ottomata>	 ^^
[18:06:41] <joal>	 ack ottomata - will do
[18:10:33] <ottomata>	 ty!
[19:11:32] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines (Sprint 12): Deprecate old mobile datasets - https://phabricator.wikimedia.org/T329310 (10mforns) a:03mforns
[19:11:52] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines (Sprint 12): Delete empty tables unique_devices_*_wide_* - https://phabricator.wikimedia.org/T329978 (10mforns) a:03mforns
[19:23:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[19:43:28] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[22:05:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (9) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:38:31] <wikibugs>	 (03CR) 10Kimberly Sarabia: "Patch to get a schema fragment for web team owned schemas per your request." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/911412 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia)
[22:45:33] <wikibugs>	 (03CR) 10Jdlrobson: "Clare is this something you could help us test and review?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/911412 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia)
[22:46:27] <wikibugs>	 (03CR) 10Jdlrobson: Creates web schema fragment (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/911412 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia)
[22:49:40] <wikibugs>	 (03CR) 10Clare Ming: Creates web schema fragment (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/911412 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia)
[22:54:55] <wikibugs>	 (03CR) 10Clare Ming: Creates web schema fragment (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/911412 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia)
[22:55:29] <wikibugs>	 (03CR) 10Clare Ming: Creates web schema fragment (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/911412 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia)