[00:38:33] (SystemdUnitFailed) firing: (18) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:33:28] (SystemdUnitFailed) firing: (19) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:33:35] (SystemdUnitFailed) firing: (18) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:02:48] btullis: oh, then the scap env is out of date for thin refinery deploys [08:11:39] good morning milimetric :D [08:22:37] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12): eventutilities-python should support using Kafka TLS ports - https://phabricator.wikimedia.org/T331526 (10gmodena) a:03gmodena [08:49:17] 10Data-Engineering, 10API Platform (AQS 2.0 Roadmap), 10Epic, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Page Analytics Service - https://phabricator.wikimedia.org/T288296 (10Sfaci) [09:35:33] (03CR) 10Kosta Harlan: [C: 03+2] "LGTM, though I wonder if this should be in some shared schema that this one inherits from." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/917951 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [09:36:09] (03Merged) 10jenkins-bot: [Growth] Personalized praise: Add database [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/917951 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [10:38:28] (SystemdUnitFailed) firing: (18) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:00:24] ooh, I was still sleeping, afternoon Luca :) [11:00:39] (over the years I developed the amazing ability to sleep-monitor IRC) [11:02:47] milimetric: I was looking to update the thin refinery scap targets, but I'm not sure I understand which targets we want in there. [11:03:14] I wasn't sure why refinery needed to be on the airflow hosts [11:03:40] Exactly: https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/scap/+/refs/heads/master/environments/thin/targets [11:03:47] "thin" is anything that just needs a git clone for systemd timers that use files from the local file system [11:03:49] Why only an-airflow1001 ? [11:04:10] and "not thin" / "normal" needs the jars synced [11:04:31] I think the whole thing is in need of overhaul [11:04:56] an-airflow1005 is the new search instance, but it looks like we need to update this as well. https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow/Instances [11:05:01] if an-airflow1001 is decomissioned in favor of some other host, clearly nothing is failing on the new host so just remove an-airflow1001? [11:05:31] I will check with the search platform team too. [11:06:00] yeah, it'd be nice to get that instances section to be auto-generated... maybe we can add something custom to Datahub [11:18:47] (03PS1) 10Btullis: Remove an-airflow1001 from refinery thin deployment targets [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/919036 (https://phabricator.wikimedia.org/T333697) [11:20:08] milimetric: I've just removed it for now https://gerrit.wikimedia.org/r/c/analytics/refinery/scap/+/919036 and I'm asking in #wikimedia-search [11:27:58] 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, 10SRE, and 2 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10BTullis) I have discussed this with @jbond and @MoritzMuehlenhoff and I can appreciate now that it wo... [11:29:02] Cool, thx btullis [11:29:21] 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, 10SRE, and 2 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10BTullis) [11:50:23] 10Data-Engineering-Planning, 10Equity-Landscape: Load language data - https://phabricator.wikimedia.org/T315886 (10KCVelaga_WMF) [12:04:11] aqu: o/ Have you a moment to talk about the spark3 assembly sometime? [12:05:17] Hi btullis , sure [12:05:47] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] eventutilities-python: review and clean up - https://phabricator.wikimedia.org/T336488 (10gmodena) [12:05:56] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 13): [NEEDS GROOMING] eventutilities-python: review and clean up - https://phabricator.wikimedia.org/T336488 (10gmodena) [12:06:13] Cool. I happened upon this patch of yours from last July: https://gerrit.wikimedia.org/r/c/operations/puppet/+/810951 [12:06:30] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 13): eventutilities-python: review and clean up in preparation for a GA release. - https://phabricator.wikimedia.org/T336488 (10gmodena) [12:07:09] Do you think that this is approach is still a good option for us? [12:08:25] I totally remember now about the `spark-${spark3_version}-assembly.zip`file not existing because we're using a pyspark install. It had slipped my mind completely. [12:16:05] I think it make more sens to generate the assembly from the Gitlab CI. [12:18:34] Now, as for how to send it to HDFS, I feel OK with a manual step. As we are already sending the .deb manually to apt.wm.org [12:20:07] OK, understood. I would advocate for trying to remove both manual steps, rather than adding a second one because we already have one, but never mind. [12:20:58] Did I read in a comment that you don't believe we should put the assembly into the conda environment because it's already very large? I can't find the comment right now. [12:21:27] Yes I did. [12:22:11] Here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/901670 [12:28:55] Ah yes, thanks. OK, so would it help if we created a GitLab-CI job to generate the assembly and push it to the GitLab Package Repository, alongside the deb file? [12:30:26] And when you say "...allows us to test the jar before it reaches production" - What sort of tests do you have in mind? Automated tests, or something else? [12:33:14] If we go for full automatization, then the script `modules/profile/files/hadoop/spark3/spark3_upload_assembly.sh` should be more complex to retrieve the assembly file with the Spark version, to make sure the file matches the one from the configuration, and then to checksum with the file on HDFS (Maybe for that we could use a companion file containing the md5 only). [12:33:15] I feel the automatization of the creation of the file should be part of the CI. [12:33:15] Do you know when the "kerberos::exec spark3_upload_assembly" is going to run? In the context of deploying a new version, we would like it to run immediately after apt upgrade. [12:35:09] Letting a GitLab-CI create the assembly file feels right yes. [12:35:59] It would run on the analytics_cluster::coordinator role (and the equivalent for the test cluster) [12:37:48] "Do you know when the "kerberos::exec spark3_upload_assembly" is going to run? " > I think it would be triggered immediately after the Package['spark3'] installation, but we could check that. [12:41:34] OK, thanks for all this. I'm in support of whatever you think feels right in terms in terms of building this assembly. However, I'm keen on automating it too. We can do this one-off for iceberg just by running the existing script from a workstation though; is that correct? [12:42:15] About testing the assembly file, I would do it the same way as someone could test the deb package before sending it to production: [12:42:16] * Download the .deb in a statbox [12:42:16] * Put the assembly file on HDFS [12:42:16] * Setup a directory with the Spark configuration (update path to the assembly file) [12:42:16] * Run a Spark job. [12:43:01] The manual one-off for Iceberg seems OK. [12:52:02] Right, lots to think about. Many thanks. [12:53:18] Maybe we could build the assembly and then push it to archiva? [13:11:49] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: Improve mediawiki-event-enrichment test suite - https://phabricator.wikimedia.org/T328013 (10CodeReviewBot) tchin opened https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_requests/... [13:11:59] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: Improve mediawiki-event-enrichment test suite - https://phabricator.wikimedia.org/T328013 (10CodeReviewBot) [13:40:41] 10Data-Engineering, 10Data-Persistence, 10Event-Platform Value Stream, 10IP Masking, 10Platform Engineering: MediaWiki user types - https://phabricator.wikimedia.org/T336176 (10Ottomata) Okay, Event Platform [[ https://phabricator.wikimedia.org/T333468#8843783 | needs to release ]] the mediawiki.page_cha... [13:41:57] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 08), 10MW-1.40-notes (1.40.0-wmf.23; 2023-02-13), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) Okay, based on [[ https://phabricator.wikimedi... [13:42:09] PROBLEM - Checks that the airflow database for airflow analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow db check did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:44:05] PROBLEM - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:45:05] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 12), 10Patch-For-Review: Setup config to allow lineage instrumentation - https://phabricator.wikimedia.org/T333004 (10Antoine_Quhen) https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/382 [13:46:50] 10Data-Engineering, 10Data-Persistence, 10Event-Platform Value Stream, 10IP Masking, 10Platform Engineering: MediaWiki user types - https://phabricator.wikimedia.org/T336176 (10Ottomata) Anybody got a quick and easy better name for this concept than 'user type'? Somehow 'type' doesn't seem quite right. [14:06:47] editation [14:07:07] oops, wrong place [14:37:37] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 08), 10MW-1.40-notes (1.40.0-wmf.23; 2023-02-13): mediawiki/page/change event schema - Use single array field for user attributes instead of boolean fields - https://phabricator.wikimedia.org/T336506 (10Ottomata) [14:38:24] 10Data-Engineering, 10Data-Persistence, 10Event-Platform Value Stream, 10IP Masking, 10Platform Engineering: MediaWiki user types - https://phabricator.wikimedia.org/T336176 (10Ottomata) Task for event schema change: {T336506} [14:38:33] (SystemdUnitFailed) firing: (18) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:55:16] !log replaced /user/spark/share/lib/spark-3.1.2-assembly.jar in HDFS with new version that includes Iceberg. [14:55:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:30:52] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 12), 10Patch-For-Review: Setup config to allow lineage instrumentation - https://phabricator.wikimedia.org/T333004 (10Antoine_Quhen) Some propositions for an immediate and more useful next step: 1/ Declare simple lineage between datasets in our Airflow d... [16:03:39] (03PS1) 10Kimberly Sarabia: Add new fragment for editattemptstep [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/919157 (https://phabricator.wikimedia.org/T335309) [16:03:43] /9 [16:03:45] err [16:04:41] 10Quarry: Public viewing of superset - https://phabricator.wikimedia.org/T336522 (10rook) [16:27:52] 10Data-Engineering, 10SRE, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464 (10BCornwall) [16:33:28] (SystemdUnitFailed) firing: (19) wmf_auto_restart_airflow-scheduler@analytics_product.service Failed on an-airflow1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:34:37] PROBLEM - Check systemd state on an-airflow1006 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_airflow-scheduler@analytics_product.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:50] (03CR) 10Ebernhardson: "> The replacement airflow instance for the search team is an-airflow1005" [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/919036 (https://phabricator.wikimedia.org/T333697) (owner: 10Btullis) [16:56:48] 10Data-Engineering-Planning, 10Traffic: varnishkafka / ATSkafka should support setting the kafka message timestamp - https://phabricator.wikimedia.org/T277553 (10BCornwall) [17:27:30] (03CR) 10Urbanecm: [Growth] Personalized praise: Add database (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/917951 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [18:27:34] (03CR) 10Jdlrobson: [C: 03+1] References new fragment in scroll [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/916625 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia) [18:32:14] (03CR) 10Gmodena: mediawiki/page/change - Use single array field for user attributes (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/919106 (https://phabricator.wikimedia.org/T336506) (owner: 10Ottomata) [18:46:51] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12): eventutilities-python should support using Kafka TLS ports - https://phabricator.wikimedia.org/T331526 (10gmodena) [18:58:17] 10Quarry: Assign multiple roles on superset login - https://phabricator.wikimedia.org/T336539 (10rook) [19:29:51] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 12), 10Patch-For-Review: Setup config to allow lineage instrumentation - https://phabricator.wikimedia.org/T333004 (10mforns) Oh, wow! 3/ looks magical. It has an on-premise version that we could install as a server. We could call it from Airflow from eit... [19:46:28] (03PS1) 10Urbanecm: Add pageviews_token to analytics/mediawiki/mentor_dashboard/visit [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/919218 (https://phabricator.wikimedia.org/T325117) [20:02:53] 10Data-Engineering: Codex, Graph, and Wikistats walk into a bar graph - https://phabricator.wikimedia.org/T336544 (10Milimetric) a:03Milimetric [20:33:33] (SystemdUnitFailed) firing: (18) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:47:26] (03CR) 10Tsevener: [C: 03+2] Update mobile apps iOS schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/919109 (https://phabricator.wikimedia.org/T328487) (owner: 10Mazevedo) [20:47:58] (03Merged) 10jenkins-bot: Update mobile apps iOS schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/919109 (https://phabricator.wikimedia.org/T328487) (owner: 10Mazevedo) [21:12:07] (03PS1) 10Urbanecm: [WIP] Add analytics/mediawiki/mentor_dashboard/interaction [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/919236 (https://phabricator.wikimedia.org/T325117) [21:12:37] (03CR) 10CI reject: [V: 04-1] [WIP] Add analytics/mediawiki/mentor_dashboard/interaction [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/919236 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [21:24:37] 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 12): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Eevans) a:03Eevans Ok, this is setup and has been tested. I acreated the t... [22:23:49] (03CR) 10Jdlrobson: [C: 03+1] "Clare: can you merge this one if it looks okay to you?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/916625 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia) [23:03:11] (03CR) 10Clare Ming: [C: 03+2] "tested locally with updates to `webUIScroll.js` in WikimediaEvents pointing to this version number" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/916625 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia) [23:03:49] (03Merged) 10jenkins-bot: References new fragment in scroll [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/916625 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia)