[00:06:49] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10EpicPupper) It’s bold to speak for all Quarry users with “we”. You still haven’t mentioned any actual features that Superset does jot have and... [00:19:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:44] (SystemdUnitFailed) firing: monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:32:08] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10IKhitron) > It’s bold to speak for all Quarry users with “we”. Never said all of the Users. > You still haven’t mentioned any actual features... [04:21:44] (SystemdUnitFailed) firing: monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:19:38] 10Data-Engineering, 10AQS2.0, 10API Platform (AQS 2.0 Roadmap), 10Epic, and 2 others: AQS 2.0: Device Analytics service - https://phabricator.wikimedia.org/T288298 (10SGupta-WMF) [07:09:52] 10Data-Engineering, 10Infrastructure-Foundations: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10elukey) @MoritzMuehlenhoff something seems not right on krb1001, the root partition is full but `ncdu` and similar tools don't show any particular culprit. I found this via `lsof... [07:16:44] (SystemdUnitFailed) firing: (2) monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:25:51] 10Data-Engineering, 10Infrastructure-Foundations: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10MoritzMuehlenhoff) >>! In T337906#8901375, @elukey wrote: > @MoritzMuehlenhoff something seems not right on krb1001, the root partition is full but `ncdu` and similar tools don't... [08:04:51] (HdfsFSImageAge) firing: The HDFS FSImage on analytics-hadoop:an-master1002:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [08:09:47] 10Data-Engineering, 10Event-Platform Value Stream: Get coverage artifacts from Kokkuri - https://phabricator.wikimedia.org/T337400 (10tchin) [08:09:51] (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [08:12:29] 10Data-Engineering, 10Event-Platform Value Stream: Get coverage artifacts from Kokkuri - https://phabricator.wikimedia.org/T337400 (10tchin) Ok so just recounting my experiments: I used a build flag to copy the insides of the kokkuri container into the gitlab container so we can get the artifacts: `BUILDCTL_... [08:15:59] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A): eventutilities-python: review and clean up in preparation for a GA release. - https://phabricator.wikimedia.org/T336488 (10gmodena) [08:24:51] (HdfsFSImageAge) resolved: (2) The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [08:36:57] 10Data-Engineering, 10Equity-Landscape: Update country_meta_data columns - https://phabricator.wikimedia.org/T338120 (10ntsako) a:05ntsako→03JAnstee_WMF [08:38:11] 10Data-Engineering, 10Equity-Landscape: Update country_meta_data columns - https://phabricator.wikimedia.org/T338120 (10ntsako) Hi @JAnstee_WMF, I have update ntsako.country_meta_data, please sign-off if everything is in order. [09:14:03] 10Data-Engineering, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10BTullis) I see that we are affected by the same issue as {T336281} in that two consecutive runs of puppet add and then remove hi... [09:39:45] 10Data-Engineering, 10Event-Platform Value Stream, 10Release-Engineering-Team: eventutillities-python should publish python doc to doc.wikimedia.org - https://phabricator.wikimedia.org/T337475 (10CodeReviewBot) gmodena updated https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-python/-/merge_... [09:43:23] 10Data-Engineering, 10Product-Analytics: Remove home/HDFS leftovers of xihua - https://phabricator.wikimedia.org/T337711 (10BTullis) I stopped a running jupyterhub-singleuser service on stat1005 in order to allow puppet to run cleanly. ` btullis@stat1005:~$ sudo systemctl stop jupyter-xihua-singleuser-conda-an... [09:45:21] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Jupyterhub is broken in conda-analytics 0.0.13 - https://phabricator.wikimedia.org/T337471 (10BTullis) 05Open→03Resolved [09:46:14] o/ is there some maintenance on hdfs/hadoop, seeing "Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error"? [09:46:45] hi dcausse - I don't think we're under operation, but under failure :) [09:47:26] ah sorry about that, good luck! [09:50:24] dcausse: actually it seems hadoop fixed itself [09:50:29] dcausse: I can read from HDFS [09:50:35] Would you mind trying again? [09:50:39] sure [09:51:35] yes now I can see my error logs [09:52:01] ok cool - HDFS had a hiccup, but I don't really know why :/ [09:52:02] Woah, I didn't know about this. [09:52:25] I got this error running yarn logs --applicationId [09:52:26] !log powered up an-worker1125 [09:52:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:54:48] actuall it happened again just now... [09:55:43] OK, the active namenode is currently running on the standby server, so it must have flipped at some point. [09:55:49] https://www.irccloud.com/pastebin/5WtlqiJL/ [10:06:11] I'm seeing some more errors in the logs like this, but they're all coming from the standby server, an-master1001. [10:06:14] 2023-06-05 10:04:09,748 WARN SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 10.64.53.29:37620:null (DIGEST-MD5: IO error acquiring password) with true cause: (Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error) [10:07:44] dcausse: where are your requests originating? Is it possible that somehow an-master1001 could have been hardcoded as the active namenode, which wouldn't allow it to work properly in this failover state? [10:08:55] I can initiate a failback from an-master1002 -> an-master1001 but it would be good to work out why these particular requests are failing. [10:08:57] btullis: it was first running "yarn logs" from stat1004 and then from the job itself failing on node an-worker1103.eqiad.wmnet [10:09:09] yarn logs -applicationId application_1678266962370_533672 [10:09:41] OK, thanks. Nothing particularly exotic then :( [10:09:42] looking at my job args if they mention something about an-master1001 [10:09:54] Aha :) [10:10:26] Oh, I misread, you said 'if they mention' [10:11:56] hm.. I don't see anything hardcoded there (https://phabricator.wikimedia.org/P48706) [10:12:41] I can try again if this helps? [10:16:16] I can see a stream of `Auth failed` log messages from the command: `tail -f /var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-master1001.log|grep failed` on an-master1001 from lots of different hosts, so something isn't right. [10:17:59] I'd also like to know when it failed over and whether we had any notification about it. Can you give me a few more minutes to look at it. If I can't find the answer soon, then I'll fail back to the primary namenode and it should be better. [10:18:45] btullis: thanks! but nothing's urgent for me so please take your time :) [10:41:53] 10Data-Engineering-Icebox, 10Data-Platform-SRE, 10Observability-Logging, 10Wikimedia-Logstash: Evaluate storing logs from applications in yarn with the typical logging infrastructure - https://phabricator.wikimedia.org/T300937 (10BTullis) [10:42:18] do we have a sense when these started to fail? we had krb1001 run out of disk space yesterday (I fixed it this morning) (not sure if an out-of-disk KDC might actually trigger this, just mentioning it just in case) [10:43:24] moritzm: That's a good shout. I haven't yet found out when the failover from an-master1001 -> an-master1002 happened, but I will bear it in mind. [10:45:07] At the moment I'm still trying to work out exactly what isn't working correctly under this failover situation. I think it's something to do with the yarn log aggregation step, once a job completes. [11:16:44] (SystemdUnitFailed) firing: monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:26:52] 10Data-Engineering, 10Data-Platform-SRE: spark3 in yarn master mode exhibits warnings when the HDFS namenodes are in the failed over state - https://phabricator.wikimedia.org/T338137 (10BTullis) [11:30:48] 10Data-Engineering, 10Data-Platform-SRE: spark3 in yarn master mode exhibits warnings when the HDFS namenodes are in the failed over state - https://phabricator.wikimedia.org/T338137 (10BTullis) Note that a `spark2-shell --master yarn` does not exhibit these warnings. ` btullis@stat1004:/etc/hadoop/conf$ spark... [11:31:16] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A): mediawiki-page-content-change-enrichment checkpoints should be stored in Swift - https://phabricator.wikimedia.org/T336656 (10JArguello-WMF) 05Open→03Resolved [11:31:20] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A): eventutilities-python manager should set up python logging with ECS format - https://phabricator.wikimedia.org/T335802 (10JArguello-WMF) 05Open→03Resolved [11:31:22] 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 14 A): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10JArguello-WMF) [11:31:24] I've created https://phabricator.wikimedia.org/T338137 to research the behaviour of the yarn logs and spark3 when in a failover condition. [11:31:25] 10Data-Engineering-Planning, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 14 A): mediawiki-event-enrichment in k8s should use mwapi-async envoy listener for MW api requests - https://phabricator.wikimedia.org/T333575 (10JArguello-WMF) 05Open→03Resolved [11:31:27] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Define Service Level Objective (SLO) for mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T333833 (10JArguello-WMF) 05Open→03Resolved [11:31:29] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 14 A), 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10JArguello-WMF) [11:31:33] 10Data-Engineering-Planning, 10Epic, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10JArguello-WMF) [11:31:37] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10JArguello-WMF) 05Open→03Resolved [11:31:43] 10Data-Engineering-Planning, 10Epic, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10JArguello-WMF) [11:31:47] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A): eventutilities-python should support using Kafka TLS ports - https://phabricator.wikimedia.org/T331526 (10JArguello-WMF) 05Open→03Resolved [11:31:51] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 14 A), 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10JArguello-WMF) 05Open→03Resolved [11:31:57] I'm about to fail back the DFS namenode to an-master1001 unless anyone objects in the next few minutes. [11:31:57] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Flink App Deployment monitoring - https://phabricator.wikimedia.org/T328925 (10JArguello-WMF) 05Open→03Resolved [11:32:01] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10JArguello-WMF) [11:32:05] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): Deploy mediawiki-event-enrichment flink app to DSE k8s - https://phabricator.wikimedia.org/T325305 (10JArguello-WMF) [11:32:13] 10Data-Engineering-Planning, 10Epic, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10JArguello-WMF) [11:32:17] 10Data-Engineering-Planning, 10Epic, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10JArguello-WMF) [11:32:21] 10Analytics, 10Data-Engineering, 10DBA, 10Event-Platform Value Stream, 10WMF-Architecture-Team: Consistent MediaWiki state change events | MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10JArguello-WMF) [11:32:27] 10Data-Engineering, 10Metrics-Platform-Planning, 10Product-Analytics, 10WMF-Architecture-Team, and 3 others: Major (API) versioning of Event Platform streams - https://phabricator.wikimedia.org/T332212 (10JArguello-WMF) 05Open→03Resolved [11:43:11] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B): jsonschema-tools tests - ensure that array items type is set - https://phabricator.wikimedia.org/T329515 (10JArguello-WMF) [11:43:14] !log sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet [11:43:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:43:36] https://www.irccloud.com/pastebin/yXgI8wKU/ [11:43:44] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817 (10JArguello-WMF) [11:43:53] 10Data-Engineering-Planning, 10Epic, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10JArguello-WMF) [11:43:58] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: Improve mediawiki-event-enrichment test suite - https://phabricator.wikimedia.org/T328013 (10JArguello-WMF) [11:44:07] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B): Improve Event Platform and MediaWiki Event Enrichment wikitech documentation - https://phabricator.wikimedia.org/T329629 (10JArguello-WMF) [11:44:11] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: Enable HA failover for flink-kubernetes-operator - https://phabricator.wikimedia.org/T336185 (10JArguello-WMF) [11:44:19] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): eventutilities-python: review and clean up in preparation for a GA release. - https://phabricator.wikimedia.org/T336488 (10JArguello-WMF) [11:44:21] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: Remove user is_registered field from mediawiki/page/change schema - https://phabricator.wikimedia.org/T337395 (10JArguello-WMF) [11:44:28] 10Data-Engineering, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 14 B): Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10JArguello-WMF) [11:44:36] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B): Event Catalog: Standardize Options Handling - https://phabricator.wikimedia.org/T333795 (10JArguello-WMF) [11:46:55] 10Data-Engineering, 10Data-Platform-SRE: spark3 in yarn master mode exhibits warnings when the HDFS namenodes are in the failed over state - https://phabricator.wikimedia.org/T338137 (10BTullis) I failed back to an-master1001 and it completed successfully: ` btullis@an-master1001:/var/log/hadoop-hdfs$ sudo -u... [11:48:51] dcausse: joal: I've verified that the warnings and errors have gone, now that I have failed back to an-master1001. I've no more time to investigate it today, but I have created a ticket so that we can come back to it. [11:54:09] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Alerting, 10Patch-For-Review: Reduce alert noise associated with individual users' jupyterhub-singleuser services - https://phabricator.wikimedia.org/T336951 (10BTullis) I have now deployed this change again. [12:19:44] (SystemdUnitCrashLoop) firing: crashloop on an-test-client1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:21:14] (03PS4) 10Nick Ifeajika: simplify totals query and write it to the same destination table as by_category [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914799 (https://phabricator.wikimedia.org/T337059) [12:23:45] (03PS5) 10Nick Ifeajika: simplify totals query and write it to the same destination table as by_category [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914799 (https://phabricator.wikimedia.org/T337059) [12:24:49] (03PS6) 10Nick Ifeajika: simplify totals query and write it to the same destination table as by_category [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914799 (https://phabricator.wikimedia.org/T337059) [12:27:44] (03CR) 10Nick Ifeajika: simplify totals query and write it to the same destination table as by_category (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914799 (https://phabricator.wikimedia.org/T337059) (owner: 10Nick Ifeajika) [12:28:08] btullis: thanks! [12:34:18] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10BTullis) Moving this to waiting, whilst we plan the final deprecation... [12:34:39] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10BTullis) p:05Triage→03Medium [12:52:10] as far as I can see the hdfs failover happened for a problem with quorum in journal nodes: [12:52:13] 2023-06-05 07:10:29,097 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.64.5.27:8485, 10.64.5.29:8485, 10.64.36.113:8485, 10.64.53.29:8485, 10.64.21.116:8485], stream=QuorumOutputStream starting at txid 10588592481)) [12:52:18] java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. [12:53:36] elukey: Many thanks, was this in the namenode log? [12:54:43] btullis: yep exactly, /var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-master1001.log [12:55:09] no idea though why it happened, judging from grafana metrics I don't see anything clear [12:58:10] There's a little spike here, under journalnodes: https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&from=1685947871880&to=1685949798655 but seemingly nothing huge that would cause a 20 second timeout. [13:07:09] 10Data-Engineering, 10Event-Platform Value Stream: mediawiki-event-enrichment and event enrichment job repo templating should bundle schema repos - https://phabricator.wikimedia.org/T335045 (10Ottomata) This is done! https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/tree/main/ev... [13:20:43] https://grafana.wikimedia.org/d/000000607/cluster-overview?from=1685947468682&to=1685951732178&var-server=an-master1001&var-datasource=thanos&var-cluster=analytics&orgId=1 https://usercontent.irccloud-cdn.com/file/qFeFMyqF/image.png [13:21:12] Interesting sudden spike in processes - Not sure which hosts(s) yet. [13:25:33] Seems like all workers, so a particularly heavy job launched at around 07:10 ? [13:25:52] 10Data-Engineering, 10Event-Platform Value Stream: mediawiki-event-enrichment and event enrichment job repo templating should bundle schema repos - https://phabricator.wikimedia.org/T335045 (10tchin) The cookiecutter template also does this via a post-generation hook https://gitlab.wikimedia.org/repos/data-eng... [13:27:07] btullis: a lot of socket errors for one of the JN at around that time [13:27:10] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=an-worker1078&var-datasource=thanos&var-cluster=analytics&from=1685945462759&to=1685951409028&viewPanel=20 [13:27:57] Yes, several of the journal nodes, I think. If not all of them. [13:29:16] nothing weird in the logs though [13:29:38] 10Data-Engineering, 10Event-Platform Value Stream: mediawiki-event-enrichment and event enrichment job repo templating should bundle schema repos - https://phabricator.wikimedia.org/T335045 (10tchin) Is there a benefit to doing this in blubber though? [13:36:44] (SystemdUnitFailed) firing: (2) monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:38:35] I'm wondering if it is related to this job, which was started at 06:57 but seems quite large: https://yarn.wikimedia.org/cluster/app/application_1678266962370_533091 [13:41:45] btullis: ok if I turn off dse-worker1002 to move its gpus to lift wing? [13:42:00] elukey: Be my guest :-) [13:42:16] ack :) [13:42:16] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Decommission an-test-coord1002 - https://phabricator.wikimedia.org/T336062 (10BTullis) [13:42:41] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10BTullis) [13:43:05] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10BTullis) [13:43:23] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10BTullis) [13:43:46] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10BTullis) [13:44:17] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10BTullis) [13:44:46] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) [13:45:17] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Refactor analytics-meta MariaDB layout to use an-mariadb100[12] - https://phabricator.wikimedia.org/T284150 (10BTullis) [13:45:46] 10Data-Engineering-Planning, 10Data-Catalog, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Review and improve the build process for DataHub containers - https://phabricator.wikimedia.org/T303381 (10BTullis) [13:46:30] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Upgrade the druid-analytics cluster to bullseye - https://phabricator.wikimedia.org/T332604 (10BTullis) [13:46:44] (SystemdUnitFailed) firing: (2) monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:53:51] 10Data-Engineering-Planning, 10Epic, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10CodeReviewBot) otto merged https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-... [13:59:54] ottomata, btullis - when you have a moment could you review https://phabricator.wikimedia.org/T337825 and let me know if I am crazy or not? [14:01:52] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Rebuild hive-hcatalog package for bullseye to address missing symlinks - https://phabricator.wikimedia.org/T337465 (10BTullis) p:05Triage→03High a:03BTullis Expediting this, since it will: 1) block any further... [14:05:45] elukey: sounds good to me! thank yo u! [14:05:51] <3 [14:08:34] elukey: Yep, looks good. I'll give 924507 a good review over the next day or so, when I have more time. [14:09:19] One question on the systemd one, but feel free to ignore if it's inconvenient or I'm being stupid. [14:10:48] btullis: thanks! good point, I didn't know about targets, I added my ideas but if you know more please add suggestions in the code review :) [14:12:35] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Flink App Deployment monitoring - https://phabricator.wikimedia.org/T328925 (10CodeReviewBot) otto opened https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_requests/61 Bump to e... [14:12:45] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Flink App Deployment monitoring - https://phabricator.wikimedia.org/T328925 (10CodeReviewBot) [14:21:44] (SystemdUnitFailed) firing: (2) monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:38:24] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Flink App Deployment monitoring - https://phabricator.wikimedia.org/T328925 (10CodeReviewBot) otto merged https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_requests/61 Bump to e... [14:38:26] 10Data-Engineering, 10Event-Platform Value Stream: Fix eventutillites_python stream_manager error_sink configuration - https://phabricator.wikimedia.org/T335591 (10CodeReviewBot) otto merged https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_requests/61 Bump to eventutiliti... [14:59:16] (03PS8) 10Peter Fischer: Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) [15:08:21] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: Improve mediawiki-event-enrichment test suite - https://phabricator.wikimedia.org/T328013 (10JArguello-WMF) [15:13:31] 10Data-Engineering, 10Event-Platform Value Stream: mediawiki-event-enrichment and event enrichment job repo templating should bundle schema repos - https://phabricator.wikimedia.org/T335045 (10JArguello-WMF) 05Open→03Resolved [15:14:56] 10Data-Engineering, 10Equity-Landscape: Update country_meta_data columns - https://phabricator.wikimedia.org/T338120 (10JAnstee_WMF) @ntsako I missed a final comment on our column naming input from Product Analytics ([[ https://phabricator.wikimedia.org/T318850 | T318850: Provide recommendations for Regional... [15:15:14] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): jsonschema-tools deterministic schema test should fail if a schema uses oneOf with different types - https://phabricator.wikimedia.org/T337855 (10JArguello-WMF) [15:15:33] 10Data-Engineering, 10Equity-Landscape, 10Movement-Insights: Update country_meta_data columns - https://phabricator.wikimedia.org/T338120 (10JAnstee_WMF) p:05Triage→03High a:05JAnstee_WMF→03ntsako [15:15:58] 10Data-Engineering, 10Release-Engineering-Team, 10Event-Platform Value Stream (Sprint 14 B): eventutillities-python should publish python doc to doc.wikimedia.org - https://phabricator.wikimedia.org/T337475 (10JArguello-WMF) [15:16:39] 10Data-Engineering, 10Event-Platform Value Stream: Move eventutiltities-python repo into main wikimedia-eventutilities repository - https://phabricator.wikimedia.org/T337491 (10JArguello-WMF) p:05Triage→03Medium [15:16:48] 10Data-Engineering, 10Event-Platform Value Stream: Move wikimedia-event-utilities to gitlab - https://phabricator.wikimedia.org/T337477 (10JArguello-WMF) p:05Triage→03Medium [15:18:06] 10Data-Engineering, 10Data Pipelines, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: Fix wikimedia-event-utilities Guava dependencies issues - https://phabricator.wikimedia.org/T337421 (10JArguello-WMF) p:05Triage→03High [15:18:37] 10Data-Engineering, 10Data Pipelines, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: Fix wikimedia-event-utilities Guava dependencies issues - https://phabricator.wikimedia.org/T337421 (10JArguello-WMF) [15:18:57] 10Data-Engineering, 10Event-Platform Value Stream: Get coverage artifacts from Kokkuri - https://phabricator.wikimedia.org/T337400 (10JArguello-WMF) p:05Triage→03Medium a:05gmodena→03tchin [15:19:06] !log deployed airflow analytics to fix edit_hourly DAG [15:19:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:19:10] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): Get coverage artifacts from Kokkuri - https://phabricator.wikimedia.org/T337400 (10JArguello-WMF) [15:20:05] 10Data-Engineering, 10Event-Platform Value Stream: Update eventgate helm chart to use automatic kafka egress networkpolicies - https://phabricator.wikimedia.org/T335024 (10Ottomata) a:03Ottomata [15:20:25] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): Update eventgate helm chart to use automatic kafka egress networkpolicies - https://phabricator.wikimedia.org/T335024 (10JArguello-WMF) [15:22:43] (03PS1) 10Amire80: Add Arabic and Urdu [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/927231 [15:28:14] (03PS7) 10Nick Ifeajika: simplify totals query and write it to the same destination table as by_category [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914799 (https://phabricator.wikimedia.org/T337059) [15:50:13] 10Data-Engineering, 10Product-Analytics (Kanban): Product Analytics ETL Migration: Pilot (MediaSearch ETLs) - https://phabricator.wikimedia.org/T333208 (10JArguello-WMF) [15:50:17] 10Data-Engineering-Planning, 10Data Pipelines, 10Epic: Support for Product Analytics Data Pipelines Migration to Airflow - https://phabricator.wikimedia.org/T332997 (10JArguello-WMF) [16:00:10] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 14), 10Patch-For-Review: Add Python Linter Checks to CI - https://phabricator.wikimedia.org/T318346 (10CodeReviewBot) aqu opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/417 Gitlab linting in CI [16:00:21] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 14), 10Patch-For-Review: Add Python Linter Checks to CI - https://phabricator.wikimedia.org/T318346 (10CodeReviewBot) [16:19:44] (SystemdUnitCrashLoop) firing: crashloop on an-test-client1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [16:20:36] !log pooling service=wikireplicas-a,name=dbproxy1019.eqiad.wmnet to allow us to depool the analytics wikireplica servers [16:20:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:21:20] !log depooling service=wikireplicas-a,name=dbproxy1018.eqiad.wmnet [16:21:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:25:45] 10Data-Engineering, 10Equity-Landscape, 10Movement-Insights: Update country_meta_data columns - https://phabricator.wikimedia.org/T338120 (10JAnstee_WMF) @ntsako Signing off on the change implemented! [16:28:53] (03CR) 10Ottomata: Encode redirect targets in page change events. (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [16:49:53] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10Ottomata) [16:55:55] 10Data-Engineering, 10Data Pipelines (Sprint 14), 10Google-Chrome-User-Agent-Deprecation, 10Product-Analytics (Kanban): Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10JArguello-WMF) [16:57:53] 10Data-Engineering, 10Event-Platform Value Stream: mw-page-content-change-enrich should ensure ordering by wiki_id,page_id, and (re)produce kafka keys - https://phabricator.wikimedia.org/T338169 (10Ottomata) [16:57:58] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines (Sprint 14): Wikistats Bug: Small countries not displayed on the map - https://phabricator.wikimedia.org/T338033 (10JArguello-WMF) p:05Triage→03High [17:31:50] 10Data-Engineering, 10DBA, 10Data-Services, 10TaxonBot, and 3 others: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10SD0001) s1 looks to be down again. ` sd@tools-sgebastion-10:~$ sql enwiki ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial commun... [17:36:57] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [Shared Event Platform] Implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10Ottomata) [17:39:58] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [Shared Event Platform] Implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10Ottomata) Actually, no we should not close this. > Eventual consistency issues are understood and mi... [17:41:20] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [Shared Event Platform] Implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10Ottomata) > retry events that are considered recent, in our case a good value for recent is 10s (c.f. T2... [17:41:40] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10Ottomata) [17:42:25] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B): [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10Ottomata) [17:47:14] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B): [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10Ottomata) @gmodena what should we do with the page_content_change event... [17:50:55] 10Data-Engineering, 10DBA, 10Data-Services, 10TaxonBot, and 3 others: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Ladsgroup) DB-wise things are good: {F37094416} I think something is broken on network side of things. Please file a separate ticket. [17:51:44] (SystemdUnitFailed) firing: (3) monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:16:44] (SystemdUnitFailed) firing: (3) monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:20:02] !log restarted haproxy service on dbproxy1018 for T338172 [18:20:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:20:04] T338172: Can't connect to analytics replicas from Toolforge - https://phabricator.wikimedia.org/T338172 [18:21:44] (SystemdUnitFailed) firing: (3) monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:32:23] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 14), 10Patch-For-Review: Add Python Linter Checks to CI - https://phabricator.wikimedia.org/T318346 (10CodeReviewBot) aqu opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/418 Format the code following implementatio... [18:36:44] (SystemdUnitFailed) firing: (3) monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:56:17] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817 (10Ottomata) I tried deploying 'cleanup by removing wgEventBusStreamNamesMap override in mediawiki-config' again today, and once again broke... [19:22:19] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12): Event Driven Enrichment Pipelines repositories should be generated from a template - https://phabricator.wikimedia.org/T324980 (10tchin) 05Open→03Resolved [20:19:44] (SystemdUnitCrashLoop) firing: crashloop on an-test-client1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:53:37] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): eventutilities-python: review and clean up in preparation for a GA release. - https://phabricator.wikimedia.org/T336488 (10gmodena) [21:03:13] 10Analytics-Radar: Remove old origin-when-crossorigin Safari misspelling of referrer policy - https://phabricator.wikimedia.org/T338183 (10TheDJ) [21:07:02] 10Analytics-Radar, 10Patch-For-Review: Remove old origin-when-crossorigin Safari misspelling of referrer policy - https://phabricator.wikimedia.org/T338183 (10TheDJ) Tagging @Nuria and @Krinkle [21:44:11] 10Data-Engineering, 10Movement-Insights, 10Product-Analytics, 10Research-Backlog: Investigate relation of UA deprecation to increase in automated traffic and reduction in unique devices - https://phabricator.wikimedia.org/T336715 (10Mayakp.wiki) [21:55:15] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): Get coverage artifacts from Kokkuri - https://phabricator.wikimedia.org/T337400 (10dancy) @tchin Just wanted to acknowledge receipt of your message though I wasn't able to focus on it today. [22:29:19] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): eventutilities-python: review and clean up in preparation for a GA release. - https://phabricator.wikimedia.org/T336488 (10CodeReviewBot) otto merged https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-python/-/merge_requests/62 G... [22:29:28] 10Data-Engineering, 10Release-Engineering-Team, 10Event-Platform Value Stream (Sprint 14 B): eventutillities-python should publish python doc to doc.wikimedia.org - https://phabricator.wikimedia.org/T337475 (10CodeReviewBot) otto merged https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-pytho... [22:36:44] (SystemdUnitFailed) firing: (2) monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:06:21] 10Data-Engineering, 10Movement-Insights, 10Product-Analytics, 10Research-Backlog: Investigate relation of UA deprecation to increase in automated traffic and reduction in unique devices - https://phabricator.wikimedia.org/T336715 (10Mayakp.wiki) [23:33:29] 10Data-Engineering, 10Data Pipelines (Sprint 14), 10Google-Chrome-User-Agent-Deprecation, 10Product-Analytics (Kanban): Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10Mayakp.wiki) @kostajh : Thank you for clarifying when the hints would be logged....