[00:26:41] ACKNOWLEDGEMENT - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T336826 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:38:19] what do I need to do to fix ClassNotFoundException errors for things like org.springframework.security.web.util.matcher.IpAddressMatcher when using hive and the GetNetworkOriginUDF from stat1007? I haven't touched this stuff for ages and all of my scripts seem to have bit rotted. [00:39:18] * bd808 will wander away, but hope his bouncer delivers magic answers from helpful folks at some point ;) [02:13:35] (SystemdUnitFailed) firing: (19) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:15:47] 10Data-Engineering, 10DBA: dbstore1003 filling up - https://phabricator.wikimedia.org/T336733 (10Marostegui) s5 was done: ` root@dbstore1003:/srv# du -sh * 1.9T sqldata.s1 562G sqldata.s5 1.5T sqldata.s7 ` Going for s7 now [06:13:35] (SystemdUnitFailed) firing: (19) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:57:25] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [NEEDS GROOMING] Improve reliability of simple stateless services - https://phabricator.wikimedia.org/T322125 (10gmodena) >>! In T322125#8857298, @Ottomata wrote: > @gmodena can we close this task? Yes. This is a duplicate of work we already completed. [06:57:42] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [NEEDS GROOMING] Improve reliability of simple stateless services - https://phabricator.wikimedia.org/T322125 (10gmodena) 05Open→03Resolved [08:33:12] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data... [08:33:22] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10CodeReviewBot) [08:35:19] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data... [08:56:08] Kerberos PSA, I've upgraded krb1001 to Bullseye and I'm adding it back to the list of active KDCs in a bit. krb1001 works fine in my tests, but if you notice any issues with kerberised services, please let me know (or revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/920637) [08:56:54] ack, thanks moritzm - Will be on the lookout for any issues. [09:10:08] bd808: Hi - My guess is that you're using a jar that is either not presdent anymore in our cluster, or wrongly named - In your script you should have a `ADD JAR ...` sentence - Could you please paste it there so that I check it? [09:11:13] bd808: Also, we're pushing toward removing Hive engine support, in favor of spark - if you have some time next week, I'd like to show how to run your script in there (it should work without change from Hive) [09:23:12] 10Data-Engineering, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10BTullis) Adding #data-platform-sre tag for visibility. [09:48:35] 10Data-Engineering, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10JMeybohm) As we have one zookeeper cluster per DC, I think it's not required to include the DC name in the Zookeeper p... [10:13:35] (SystemdUnitFailed) firing: (19) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:18:12] 10Data-Engineering, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10gmodena) >>! In T331283#8858396, @JMeybohm wrote: >> Who's currently responsible for Zookeeper? Is there a request pr... [10:18:46] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10BTullis) The deb file for conda-analytics version 0.0.14 has now been published. https://gi... [10:28:12] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data... [10:28:23] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10CodeReviewBot) [10:29:15] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data... [10:36:34] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10BTullis) I'm reverting the version 0.0.14 release, deleting the v0.0.14 tag and re-running... [10:42:28] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data... [10:42:38] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10CodeReviewBot) [10:42:49] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data... [11:50:12] 10Quarry, 10cloud-services-team (FY2022/2023-Q4): Consider moving Quarry to be an installation of a community supported analytics tool - https://phabricator.wikimedia.org/T169452 (10Rchard2scout) Is this the right place to complain about features from Quarry that I'm missing from Superset? Because if so, I'm m... [11:52:20] (03CR) 10Nikerabbit: [C: 03+1] Make consistent identation in languages.json [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/920275 (owner: 10Amire80) [12:10:41] joal: The script (written years ago) has `ADD JAR /srv/deployment/analytics/refinery/artifacts/refinery-hive.jar;` I added `ADD JAR /srv/deployment/analytics/refinery/artifacts/refinery-core.jar;` to get past the first set of class not found warnings. Its basically acting like there is no class path for autoloading things anymore. [12:10:52] 10Quarry, 10cloud-services-team (FY2022/2023-Q4): Consider moving Quarry to be an installation of a community supported analytics tool - https://phabricator.wikimedia.org/T169452 (10Huji) I got here through the message at the top of Quarry. I really liked Superset's interface. The first issue that came to my a... [12:11:41] spark3-sql isn't doing much better. `Error in query: No handler for UDF/UDAF/UDTF 'org.wikimedia.analytics.refinery.hive.GetNetworkOriginUDF': java.lang.reflect.InvocationTargetException; line 9 pos 9` [12:12:09] 10Quarry, 10cloud-services-team (FY2022/2023-Q4): Consider moving Quarry to be an installation of a community supported analytics tool - https://phabricator.wikimedia.org/T169452 (10Huji) Another observation: in Quarry, you could get a public link to your query and share it with others; in Superset, the "Copy... [12:14:19] joal: the script in question is at stat1007:/home/bd808/projects/wmcs-api-calls/query-wmcs-usage.sql if you want to take a look. I wanted to find the number of API calls from each classified network origin January-March 2023. [12:32:33] bd808: joal , we changed to properly naming jars with dependencies suffixed with -shaded [12:32:40] so the refinery-hive|job.jar does not include the deps you want [12:32:41] try [12:32:53] ADD JAR /srv/deployment/analytics/refinery/artifacts/refinery-hive-shaded.jar [12:35:09] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data... [12:35:20] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10CodeReviewBot) [12:35:25] ottomata: thanks! That gets me past the java errors and on to actually fixing my sql. :) [12:37:55] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data... [12:38:39] 10Data-Engineering, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10Ottomata) > As we have one zookeeper cluster per DC, I think it's not required to include the DC name in the Zookeeper... [12:45:25] 10Data-Engineering, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10gmodena) >>! In T331283#8858931, @Ottomata wrote: >> As we have one zookeeper cluster per DC, I think it's not require... [12:49:30] 10Data-Engineering, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10Ottomata) Hm, I mean let's keep your proposed naming schema of: `/flink//`... [12:59:17] 10Data-Engineering, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10JMeybohm) >>! In T331283#8858931, @Ottomata wrote: >> As we have one zookeeper cluster per DC, I think it's not requir... [13:16:53] !log roll-rebooting an-worker1[096-101] for T335835 [13:17:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:18:35] 10Data-Engineering, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10Ottomata) Aye :) > It would be nice if that path could be baked into the flink-app chart somehow to prevent mistakes.... [13:19:35] elukey: o/ I'm planning to `sre.k8s.reboot-nodes dse-k8s` if that's OK with you - for T335835 [13:19:48] +1 ! [13:22:14] Ack, thanks. [13:22:28] !log roll-rebooting dse-k8s-workers via cookbook [13:22:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:22:50] btullis: ah TIL about the task, sooo many reboooottts [13:24:20] elukey: Yup, but still, the automation is getting better all the time. [13:31:15] 10Data-Engineering, 10AQS2.0, 10API Platform (AQS 2.0 Roadmap), 10Epic, and 2 others: AQS 2.0: Device Analytics service - https://phabricator.wikimedia.org/T288298 (10FJoseph-WMF) [13:31:35] 10Data-Engineering, 10AQS2.0, 10API Platform (AQS 2.0 Roadmap), 10Epic, and 2 others: AQS 2.0: Page Analytics Service - https://phabricator.wikimedia.org/T288296 (10FJoseph-WMF) [13:31:52] 10Data-Engineering, 10AQS2.0, 10API Platform (AQS 2.0 Roadmap), 10Epic, and 2 others: AQS 2.0: Media Analytics Service - https://phabricator.wikimedia.org/T288303 (10FJoseph-WMF) [13:32:23] 10Data-Engineering, 10AQS2.0, 10API Platform (AQS 2.0 Roadmap), 10Epic, and 2 others: AQS 2.0: Geo Analytics Service - https://phabricator.wikimedia.org/T288305 (10FJoseph-WMF) [13:32:54] 10Analytics, 10AQS2.0, 10API Platform (AQS 2.0 Roadmap), 10Documentation, and 3 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10FJoseph-WMF) [13:38:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:11:22] (03Abandoned) 10Ottomata: mediawiki/page/change - Use single array field for user attributes [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/919106 (https://phabricator.wikimedia.org/T336506) (owner: 10Ottomata) [14:21:25] FYI, in case anyone specifically needs a feature from the legacy AQS cookbook? https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/920704/ [14:35:56] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817 (10Ottomata) [14:42:41] Starting build #20 for job wikimedia-event-utilities-maven-release-docker [14:46:16] Project wikimedia-event-utilities-maven-release-docker build #20: 09SUCCESS in 3 min 35 sec: https://integration.wikimedia.org/ci/job/wikimedia-event-utilities-maven-release-docker/20/ [14:49:27] 10Data-Engineering, 10AQS2.0, 10API Platform (AQS 2.0 Roadmap), 10Epic, and 2 others: AQS 2.0: Page Analytics Service - https://phabricator.wikimedia.org/T288296 (10JArguello-WMF) [14:49:38] 10Data-Engineering, 10AQS2.0, 10API Platform (AQS 2.0 Roadmap), 10Epic, and 2 others: AQS 2.0: Device Analytics service - https://phabricator.wikimedia.org/T288298 (10JArguello-WMF) [15:32:44] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: eventutilities-python manager should set up python logging with ECS format - https://phabricator.wikimedia.org/T335802 (10CodeReviewBot) otto opened https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichm... [15:32:54] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: eventutilities-python manager should set up python logging with ECS format - https://phabricator.wikimedia.org/T335802 (10CodeReviewBot) [15:37:39] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: eventutilities-python manager should set up python logging with ECS format - https://phabricator.wikimedia.org/T335802 (10CodeReviewBot) otto merged https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichm... [15:49:42] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data... [15:49:49] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10CodeReviewBot) [15:50:07] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data... [16:44:46] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data... [16:44:56] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10CodeReviewBot) [16:47:18] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data... [17:01:08] 10Data-Engineering, 10Anti-Harassment, 10DBA, 10Data-Persistence, and 2 others: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 (10Tchanders) [17:03:29] (SystemdUnitFailed) firing: (20) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:06:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:44] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:27] (03PS9) 10Urbanecm: Add analytics/mediawiki/mentor_dashboard/interaction [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/919236 (https://phabricator.wikimedia.org/T325117) [17:13:29] (SystemdUnitFailed) firing: (20) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:32:50] (03CR) 10Ottomata: [C: 03+1] ProduceCanaryEvents: set a timeout on the http client (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/894642 (https://phabricator.wikimedia.org/T330236) (owner: 10DCausse) [17:33:05] (03CR) 10Ottomata: [C: 03+2] "Wow this change got lost in a backlog, and the problem bit us again. Merging." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/894642 (https://phabricator.wikimedia.org/T330236) (owner: 10DCausse) [17:33:20] 10Data-Engineering, 10Anti-Harassment, 10DBA, 10Data-Persistence, and 2 others: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 (10Marostegui) p:05Triage→03Medium a:03Ladsgroup [17:34:01] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Patch-For-Review: Event partitions missing since 2023-02-21T10:00 for stream without events (canary events not produced?) - https://phabricator.wikimedia.org/T330236 (10Ottomata) I think this happened again. I noticed today that there were no ca... [17:39:47] (03CR) 10Ottomata: [V: 03+2 C: 03+2] ProduceCanaryEvents: set a timeout on the http client [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/894642 (https://phabricator.wikimedia.org/T330236) (owner: 10DCausse) [17:40:49] (03PS1) 10Ottomata: Update changelog for v0.2.15 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/920758 [17:40:57] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Update changelog for v0.2.15 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/920758 (owner: 10Ottomata) [17:41:16] Starting build #120 for job analytics-refinery-maven-release-docker [17:55:35] Project analytics-refinery-maven-release-docker build #120: 09SUCCESS in 14 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/120/ [17:58:47] !log Deployed refinery-source using jenkins [17:58:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:59:39] Starting build #79 for job analytics-refinery-update-jars-docker [17:59:57] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.15 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/920340 [17:59:58] Project analytics-refinery-update-jars-docker build #79: 09SUCCESS in 18 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/79/ [18:01:27] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add refinery-source jars for v0.2.15 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/920340 (owner: 10Maven-release-user) [18:08:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:18:02] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Improve mediawiki-event-enrichment test suite - https://phabricator.wikimedia.org/T328013 (10CodeReviewBot) tchin merged https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_request... [18:18:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:23:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:33:29] (SystemdUnitFailed) firing: (21) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:34:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:38:29] (SystemdUnitFailed) firing: (21) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:45:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:29] (SystemdUnitFailed) firing: (21) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:51:22] 10Data-Engineering, 10Anti-Harassment, 10DBA, 10Data-Persistence, and 3 others: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 (10Ladsgroup) So once [[https://wikitech.wikimedia.org/wiki/Auto_schema|auto_schema]] starts, it's going to be quite noisy s... [18:53:29] (SystemdUnitFailed) firing: (21) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:03:29] (SystemdUnitFailed) firing: (21) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:08:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:12:34] 10Data-Engineering, 10Event-Platform Value Stream: flink-app: swift bucket and zookeeper paths should be templated. - https://phabricator.wikimedia.org/T336901 (10gmodena) [19:18:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:23:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:28:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [19:33:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:38:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:48:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:53:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [19:53:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:01:05] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) @JMeybohm, I just tried to apply ^ in staging-co... [20:03:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:08:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:09:31] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Event partitions missing since 2023-02-21T10:00 for stream without events (canary events not produced?) - https://phabricator.wikimedia.org/T330236 (10Ottomata) Everything is deployed, but I had to revert the puppet change that would use the new code... [20:18:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:23:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:33:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:38:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:48:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:53:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:03:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:08:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:18:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:23:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:33:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:38:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:48:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:53:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:03:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:06:40] 10Quarry, 10cloud-services-team (FY2022/2023-Q4): Consider moving Quarry to be an installation of a community supported analytics tool - https://phabricator.wikimedia.org/T169452 (10nskaggs) I've created https://wikitech.wikimedia.org/w/index.php?title=Superset to provide some more details and potentially hold... [22:08:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:18:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:23:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:33:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:33:46] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.041 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [22:38:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:48:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:53:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:03:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:08:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:18:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:23:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:33:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:38:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:48:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:53:29] (SystemdUnitFailed) firing: (20) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed