[01:19:14] RECOVERY - Check systemd state on analytics1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:26:45] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:27:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:48:38] (03PS1) 10Milimetric: [WIP] Load both high and low resolution map at the same time [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/929816 [04:59:47] 10Data-Engineering, 10DBA: Clean up clouddb1021 - https://phabricator.wikimedia.org/T337961 (10Marostegui) This was a great breakthrough too: T338284#8928454 [05:26:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:37:50] gehel: Would you mind helping me to troubleshoot a java build issue please, when you have some time? [08:39:25] Could you checkout this branch: https://gitlab.wikimedia.org/repos/data-engineering/bigtop/-/tree/update_bigtop_1.5_build and run `./build_all_bigtop_distros_wmf.sh` please? [08:41:39] I'm currently getting stuck at the `hive-service` component, within the `hive-pkg` gradle task, with an error about the `ldap-client-api` jar not being found. Like this: [08:41:54] https://www.irccloud.com/pastebin/cKBXodaM/ [08:45:03] The frustrating thing is that it worked previously: https://phabricator.wikimedia.org/T337465#8914653 [08:54:55] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Data-Persistence (work done), 10Platform Engineering Roadmap Decision Making, and 7 others: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10Ladsgroup) 05Open→03Resolved [09:00:38] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10CodeReviewBot) gmodena merged https://gitlab.wiki... [09:02:48] (03PS2) 10DCausse: mediawiki/revision/score: add the dt field [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929733 (https://phabricator.wikimedia.org/T267648) [09:06:57] btullis: the dependency to org.apache.directory.client.ldap:ldap-client-api being a SNAPSHOT is suspicious [09:15:14] dcausse: Yes, I thought so too. I found this: https://issues.apache.org/jira/browse/HIVE-21777 and tried the patch there to exclude it, but it didn't work. [09:20:20] btullis: you mean https://issues.apache.org/jira/secure/attachment/12969400/HIVE-21777.patch ? if yes I'm not even sure where you can apply the patch, are some sources downloaded during the build process? [09:23:14] Yes, I do mean that patch. Yes, each of the packages listed here https://gitlab.wikimedia.org/repos/data-engineering/bigtop/-/blob/update_bigtop_1.5_build/build_bigtop_wmf.sh#L18-30 downloads its own source archive by the top-level gradle project. Patches are then applied to each component before compiling. e.g. for hive all of these patches are applied: [09:23:14] https://gitlab.wikimedia.org/repos/data-engineering/bigtop/-/tree/update_bigtop_1.5_build/bigtop-packages/src/common/hive [09:24:55] I'm not sure that patch applies cleanly anyway, because the version numbers of the bug are 3.1.1 and 4.0.0 - whereas we are dealing with hive version 2.3.6: https://gitlab.wikimedia.org/repos/data-engineering/bigtop/-/blob/update_bigtop_1.5_build/bigtop.bom#L183 [09:26:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:44:45] I'm trying again with a bumped version of apache directory server from 1.5.6 to 1.5.7 and it looks hopeful so far. [09:46:12] btullis: the patch seems to apply on apache-hive-2.3.6-src but indeed it's not at the exact line, trying to build just apache-hive but it's extremely slow... [09:50:29] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Rebuild all hadoop packages for bullseye with different distribution suffix mechanism - https://phabricator.wikimedia.org/T337465 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/bigtop/-/merge_requests/... [09:50:58] dcausse: Thanks so much. I think the bump to apacheds server has fixed it: https://gitlab.wikimedia.org/repos/data-engineering/bigtop/-/merge_requests/4/diffs?commit_id=748c06c2729e1358bd7fe4ce559483c9f652bbdc so I'm unblocked for now. [09:51:16] \o/ [09:51:31] It is a really slow process. [09:53:18] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Rebuild all hadoop packages for bullseye with different distribution suffix mechanism - https://phabricator.wikimedia.org/T337465 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/bigtop/-/merge_requests/... [10:33:50] btullis: sorry, didn't see the message earlier. Thanks dcausse for helping! [10:34:20] btullis: I'm pretty busy today, but should we schedule some time for pairing tomorrow? [10:36:43] gehel: All good, thanks :-) We have a lot of virtual offsite today/tomorrow too, so it's up to you. Hopefully I'm unblocked on the hadoop build (T337465) for now, but I'm happy to pair if you'd like. [10:36:43] T337465: Rebuild all hadoop packages for bullseye with different distribution suffix mechanism - https://phabricator.wikimedia.org/T337465 [11:11:29] 10Data-Engineering, 10DBA: Clean up clouddb1021 - https://phabricator.wikimedia.org/T337961 (10Ladsgroup) Yup, I think I'm going to compress templatelinks in commons too. [11:58:05] (03CR) 10Joal: "Adding reviewers from the product-analytics team and the security team :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/929723 (https://phabricator.wikimedia.org/T338033) (owner: 10Aqu) [12:03:56] (03PS3) 10DCausse: mediawiki/revision/score: add the dt field [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929733 (https://phabricator.wikimedia.org/T267648) [13:03:40] (03CR) 10Ottomata: mediawiki/revision/score: add the dt field (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929733 (https://phabricator.wikimedia.org/T267648) (owner: 10DCausse) [13:15:30] !log running the puppet on an-master100[1-2] Remove analytics58_60 from the HDFS topology T317861 [13:15:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:15:32] T317861: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 [13:26:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:23:09] Starting build #24 for job wikimedia-event-utilities-maven-release-docker [14:27:37] Project wikimedia-event-utilities-maven-release-docker build #24: 09SUCCESS in 4 min 27 sec: https://integration.wikimedia.org/ci/job/wikimedia-event-utilities-maven-release-docker/24/ [14:36:44] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:46:45] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:46:59] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:01:44] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:06:44] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:10:12] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): jsonschema-tools deterministic schema test should fail if a schema uses oneOf with different types - https://phabricator.wikimedia.org/T337855 (10Ottomata) [15:10:38] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: jsonschema-tools deterministic schema test should fail if a object field does not have schema - https://phabricator.wikimedia.org/T338228 (10Ottomata) @tchin okay! if you get your npm perms right (see slack thread), try to... [15:15:06] (03CR) 10Ottomata: [C: 03+2] Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [15:15:36] (03CR) 10CI reject: [V: 04-1] Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [15:21:12] (03PS23) 10Ottomata: Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [15:24:47] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10AntiCompositeNumber) Superset is interesting, but I agree with Bawolff that simple is a good goal. Superset requires that I find the correct d... [15:26:15] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10Ottomata) > When the page_content_change applicat... [15:28:56] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10Ottomata) Actually, [[ https://codesearch.wmcloud... [15:31:44] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:36:44] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:44:14] (03CR) 10Ottomata: [C: 03+2] Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [15:48:01] (03Merged) 10jenkins-bot: Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [16:01:29] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutilities-python: review and clean up in preparation for a GA release. - https://phabricator.wikimedia.org/T336488 (10CodeReviewBot) gmodena opened https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-pyth... [16:01:44] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:06:45] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:31:45] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:35:20] thanks Krinkle ! [16:36:45] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:40:33] 10Analytics, 10AQS2.0, 10Tech-Docs-Team, 10API Platform (AQS 2.0 Roadmap), and 5 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10apaskulin) [17:01:45] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:06:44] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:21:45] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:34:43] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Rebuild all hadoop packages for bullseye with different distribution suffix mechanism - https://phabricator.wikimedia.org/T337465 (10BTullis) I've now got a successful rebuild of all of bigtop for buster and bullseye, using this new script... [17:57:14] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Rebuild all hadoop packages for bullseye with different distribution suffix mechanism - https://phabricator.wikimedia.org/T337465 (10MoritzMuehlenhoff) Nice work! [18:11:52] mforns: Heya - would you have aminute for me? [18:11:59] mforns: python/airflow testing issues [18:12:01] :( [18:12:08] joal: sure! [18:12:16] batcave? [18:29:37] 10Analytics, 10AQS2.0, 10Tech-Docs-Team, 10API Platform (AQS 2.0 Roadmap), and 5 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10FJoseph-WMF) [19:00:03] (03PS4) 10DCausse: mediawiki/revision/score: add the dt field [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929733 (https://phabricator.wikimedia.org/T267648) [19:02:11] (03CR) 10DCausse: mediawiki/revision/score: add the dt field (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929733 (https://phabricator.wikimedia.org/T267648) (owner: 10DCausse) [19:12:10] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10rook) [19:14:27] (03PS6) 10Gmodena: page_change: add a flag for bad revision data. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929786 (https://phabricator.wikimedia.org/T309699) [19:24:53] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10rook) For anyone who does not want quarry to be replaced, I've edited the ticket to be more explicit. The thing that quarry lacks is support.... [19:26:33] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutilities-python: review and clean up in preparation for a GA release. - https://phabricator.wikimedia.org/T336488 (10gmodena) [19:26:54] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10gmodena) a:03gmodena [19:28:07] (03CR) 10Ottomata: "Events are validated according to their versioned $schema URI, so old events in mediawiki.revision-score (produced by change-prop etc.) wi" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929733 (https://phabricator.wikimedia.org/T267648) (owner: 10DCausse) [19:28:53] (03CR) 10Ottomata: [C: 03+1] "Let me know if you are ready for merge." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929733 (https://phabricator.wikimedia.org/T267648) (owner: 10DCausse) [19:31:14] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10gmodena) > main question I'd have is when do we consider the flink-app chart ready enough so that we can consider switching the WDQS... [19:36:49] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10Ottomata) > main question I'd have is when do we consider the flink-app chart ready enough so that we can consider switching the WDQ... [19:47:57] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: jsonschema-tools deterministic schema test should fail if a object field does not have schema - https://phabricator.wikimedia.org/T338228 (10tchin) [20:04:55] (03CR) 10DCausse: "thanks!" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929733 (https://phabricator.wikimedia.org/T267648) (owner: 10DCausse) [20:18:12] !log reran mediawiki_history_reduced druid load task after deploying Joseph's fix [20:18:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:21:38] (03CR) 10TChin: "recheck" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929782 (https://phabricator.wikimedia.org/T338228) (owner: 10TChin) [20:59:49] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Bawolff) >>! In T169452#8932838, @rook wrote: > For anyone who does not want quarry to be replaced, I've edited the ticket to be more explicit... [21:05:06] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): Update eventgate helm chart to use automatic kafka egress networkpolicies and envoy service mesh - https://phabricator.wikimedia.org/T335024 (10Ottomata) [21:17:25] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10rook) >>! In T169452#8933082, @Bawolff wrote: > Quarry is linking to this task in a banner asking people to give feedback on what they think a... [21:21:59] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:36:45] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:41:45] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:51:02] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Base) > Quarry doesn't get a lot of support Then provide it with a lot of support, duh. It is unhelpful to once again burden unpaid volunteer...