[00:08:13] (DiskSpace) firing: Disk space stat1005:9100:/ 5.12% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [02:42:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [03:02:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [03:08:31] (SystemdUnitFailed) firing: (19) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:08:13] (DiskSpace) firing: Disk space stat1005:9100:/ 5.073% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [05:08:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [05:43:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:08:31] (SystemdUnitFailed) firing: (19) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:13] (DiskSpace) firing: Disk space stat1005:9100:/ 5.095% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:11:35] (03CR) 10Aqu: "webrequest is not bucketed anymore, so I think" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911890 (https://phabricator.wikimedia.org/T334106) (owner: 10Mforns) [09:24:37] @aqu what do you mean about webrequest? That's the old code with the hostname, sequence bucket, the new code is just 1 out of 128 [09:25:02] you're right the old code wouldn't work, but the new code still samples via spark [09:25:18] (Marcel tested it, I don't know for sure) [09:32:47] I will try again with the exact same code then. [09:48:19] (03CR) 10Aqu: "I've checked, and it works. Sorry." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911890 (https://phabricator.wikimedia.org/T334106) (owner: 10Mforns) [10:37:02] @aqu: I think you're right that it doesn't sample performantly like the old bucketed table, but we talked about that and Joseph thought it was ok, that once we migrate to Iceberg we'd get back to real sampling so this is temporary [10:47:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [10:53:35] PROBLEM - Kerberos KDC daemon on krb2002 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [10:55:00] 10Quarry, 10cloud-services-team (FY2022/2023-Q4): Consider moving Quarry to be an installation of a community supported analytics tool - https://phabricator.wikimedia.org/T169452 (10Zache) >>! In T169452#3415749, @Halfak wrote: > My main concern with this kind of move would be preserving the basic functionalit... [10:56:35] RECOVERY - Kerberos KDC daemon on krb2002 is OK: PROCS OK: 9 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [11:08:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:28] (SystemdUnitFailed) firing: (20) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:12:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [11:15:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:28] (SystemdUnitFailed) firing: (20) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:24:10] FYI, I plan to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/915569/ in a bit, it adds an additional Kerberos KDC server (krb2002, which will eventually replace krb2001) [11:24:42] it worked fine in my tests, but if there's anything odd for kerberized analytics services, let me know and we can revert [11:57:50] this has been merged now [12:01:21] 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Ladsgroup) >>! In T333223#8825107, @daniel wrote: > This seems more future proof, and no extra work. Am I missing something? We went back and forth on this a bit, there... [12:08:13] (DiskSpace) firing: Disk space stat1005:9100:/ 5.046% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [13:01:18] moritzm: Many thanks. Apologies for the delay in responding. [13:02:31] so far everything seems to work just fine anyway :-) [13:12:21] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12): Event Driven Enrichment Pipelines repositories should be generated from a template - https://phabricator.wikimedia.org/T324980 (10JArguello-WMF) [13:13:44] 10Data-Engineering, 10Event-Platform Value Stream, 10Patch-For-Review: eventutilities-python manager should set up python logging with ECS format - https://phabricator.wikimedia.org/T335802 (10tchin) a:03tchin [13:14:20] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: mediawiki-event-enrichment: issue async requests from ProcessFunction - https://phabricator.wikimedia.org/T332948 (10Ottomata) Still running! And also still backfilling! [[ https://grafana.wikimedia.org/d/K9x0c4aVk/flink-o... [13:18:10] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: eventutilities-python manager should set up python logging with ECS format - https://phabricator.wikimedia.org/T335802 (10tchin) [13:31:47] 10Data-Engineering, 10SRE, 10SRE Observability, 10Event-Platform Value Stream (Sprint 12): Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10herron) >>! In T334733#8823968, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (... [13:33:46] 10Data-Engineering, 10Privacy Engineering: The soon-to-be-released pageview datasets should be linked from dumps page - https://phabricator.wikimedia.org/T335958 (10Nuria) [13:36:51] 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10daniel) >>! In T333223#8826260, @Ladsgroup wrote: >>>! In T333223#8825107, @daniel wrote: >> This seems more future proof, and no extra work. Am I missing something? > >... [13:37:43] 10Data-Engineering, 10SRE, 10SRE Observability, 10Event-Platform Value Stream (Sprint 12): Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10elukey) ` elukey@kafka-logging1001:~$ kafka acls --list kafka-acls --authorizer-properties... [13:57:48] (03PS1) 10Nick Ifeajika: create knowledge-gap endpoints [analytics/aqs] - 10https://gerrit.wikimedia.org/r/915678 [14:00:57] (03CR) 10CI reject: [V: 04-1] create knowledge-gap endpoints [analytics/aqs] - 10https://gerrit.wikimedia.org/r/915678 (owner: 10Nick Ifeajika) [14:07:58] !log failing back hive service to an-coord1001 [14:07:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:12:22] (03PS2) 10Nick Ifeajika: create knowledge-gap endpoints [analytics/aqs] - 10https://gerrit.wikimedia.org/r/915678 [14:15:57] (03CR) 10CI reject: [V: 04-1] create knowledge-gap endpoints [analytics/aqs] - 10https://gerrit.wikimedia.org/r/915678 (owner: 10Nick Ifeajika) [14:28:06] 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 12): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) @Eevans How can we help move this along? [14:34:58] 10Data-Engineering, 10SRE, 10SRE Observability, 10Event-Platform Value Stream (Sprint 12): Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10Ottomata) Yikes, thank you, yes let's delete ACLs for kafka logging. I'm guessing that by... [14:35:38] 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Tchanders) It seems that both `user.user_is_temp` vs `actor.actor_type` (bool) have the same fundamental problems: * They're both oddly specific to temp users * They'll b... [15:04:42] 10Data-Engineering, 10Superset: SQL lab access for Andrew McAllister - https://phabricator.wikimedia.org/T335940 (10AndrewTavis_WMDE) [15:05:10] 10Data-Engineering, 10Superset: SQL lab access for Andrew McAllister - https://phabricator.wikimedia.org/T335940 (10AndrewTavis_WMDE) Added #data-engineering as it looks like #superset isn't active :) [15:18:31] (SystemdUnitFailed) firing: (19) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:08:13] (DiskSpace) firing: Disk space stat1005:9100:/ 3.769% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:10:53] 10Data-Engineering, 10Event-Platform Value Stream, 10Discovery-Search (Current work), 10Patch-For-Review: Add support for redirects - https://phabricator.wikimedia.org/T325315 (10pfischer) Use Cases (with redirect property in page change events, see [[ https://gerrit.wikimedia.org/r/c/schemas/event/primary... [16:43:52] 10Data-Engineering: Upgrade eventutiltilies-flink Java lib to Flink 1.17 - https://phabricator.wikimedia.org/T335982 (10Ottomata) [17:06:34] 10Quarry: Allow downloading output via CLI - https://phabricator.wikimedia.org/T325683 (10rook) If I understand correctly, something like: ` export QUERY_ID=73537 export QRUN_ID=$(curl https://quarry.wmcloud.org/query/${QUERY_ID}/meta | jq .latest_run.id) wget -c https://quarry.wmcloud.org/run/${QRUN_ID}/output... [17:17:06] (03PS1) 10Gerrit maintenance bot: Add gpe.wikipedia to pageview allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/915786 (https://phabricator.wikimedia.org/T335987) [17:46:23] 10Quarry: UI: Allow downloading output via CLI - https://phabricator.wikimedia.org/T325683 (10Dusan_Krehel) [17:47:35] 10Quarry: UI: Allow downloading output via CLI - https://phabricator.wikimedia.org/T325683 (10Dusan_Krehel) [17:48:54] 10Quarry: UI: Allow downloading output via CLI - https://phabricator.wikimedia.org/T325683 (10rook) @Dusan_Krehel I could use some additional clarification. Is this a request for the documentation to be updated to reflect how one might download on the command line? [18:02:45] 10Data-Engineering, 10Privacy Engineering: The soon-to-be-released pageview datasets should be linked from dumps page - https://phabricator.wikimedia.org/T335958 (10Htriedman) +1, I don't know exactly who maintains the analytics.wikimedia.org domain. There are also two other data releases with more historical... [19:18:31] (SystemdUnitFailed) firing: (19) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:43:57] (03PS1) 10Milimetric: Run actor_signature two different ways and compare to existing implementation for the same hour of webrequest [analytics/refinery] - 10https://gerrit.wikimedia.org/r/915831 [19:54:44] (03Abandoned) 10Milimetric: Run actor_signature two different ways and compare to existing implementation for the same hour of webrequest [analytics/refinery] - 10https://gerrit.wikimedia.org/r/915831 (owner: 10Milimetric) [19:55:21] (03CR) 10Milimetric: "This is just meant for testing and assessing impact of UA changes on the actor signature. A broader test would be to run more of the pipe" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/915831 (owner: 10Milimetric) [20:02:38] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: eventutilities-python manager should set up python logging with ECS format - https://phabricator.wikimedia.org/T335802 (10Ottomata) Hm, @tchin! I'm looking more closely at the python production logs, and um, they look like the... [20:08:13] (DiskSpace) firing: Disk space stat1005:9100:/ 3.697% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:09:44] !log restarting hive-server2 and hive-metastore on an-coord1002 [20:09:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:12:43] !log executed `sudo apt clean` on stat1005 to free up some space. [20:12:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:18:13] (DiskSpace) resolved: Disk space stat1005:9100:/ 3.696% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:22:27] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: eventutilities-python manager should set up python logging with ECS format - https://phabricator.wikimedia.org/T335802 (10Ottomata) Okay, I deployed with [[ https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-e... [20:24:02] (03Restored) 10Milimetric: Run actor_signature two different ways and compare to existing implementation for the same hour of webrequest [analytics/refinery] - 10https://gerrit.wikimedia.org/r/915831 (owner: 10Milimetric) [20:24:04] (03PS2) 10Milimetric: Run actor_signature two different ways and compare to existing implementation for the same hour of webrequest [analytics/refinery] - 10https://gerrit.wikimedia.org/r/915831 [20:24:29] (03Abandoned) 10Milimetric: Run actor_signature two different ways and compare to existing implementation for the same hour of webrequest [analytics/refinery] - 10https://gerrit.wikimedia.org/r/915831 (owner: 10Milimetric) [20:31:34] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: mediawiki-event-enrichment: issue async requests from ProcessFunction - https://phabricator.wikimedia.org/T332948 (10Ottomata) Redeployed with exception being raised. Instead of raising a RequestException (from requests lib)... [20:49:54] 10Data-Engineering-Planning, 10Data Pipelines, 10Shared-Data-Infrastructure: Add support for Iceberg to the Spark Docker Image - https://phabricator.wikimedia.org/T336012 (10xcollazo) [23:18:31] (SystemdUnitFailed) firing: (19) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed