[07:58:28] 10Data-Platform-SRE: Investigate an-presto1002 failures - https://phabricator.wikimedia.org/T344808 (10JAllemandou) 05Resolved→03Open [08:07:49] btullis: if you have time for a code review: https://gerrit.wikimedia.org/r/c/operations/puppet/+/950136 [08:08:00] I can give you more context about what we're trying to do if needed. [08:28:53] gehel: looking now. [08:30:47] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10Gehel) **DECISION** (as discussed in synchronous meeting): * Reading bulk data is done from the consumer (at t... [08:35:11] (03CR) 10Phuedx: Add Metrics Platform fragments by platform only (035 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [09:51:38] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) >>! In T305874#9114087, @BTullis wrote: > In case it helps, I did a little digging into the CAS logs on idp-test1002 and stumbled upon this. > ` > ro... [10:06:23] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10BTullis) [10:38:49] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform, 10Patch-For-Review: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10BTullis) I've created a patch to implement this and added some people as reviewers.... [11:07:27] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform, 10Patch-For-Review: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10BTullis) Oof! According to this: https://www.conduktor.io/kafka/how-to-send-large-m... [11:10:08] (03CR) 10Mforns: Add Metrics Platform fragments by platform only (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [11:13:00] PROBLEM - Webrequests Varnishkafka log producer on cp3074 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [11:13:24] ^looking [11:14:47] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform, 10Patch-For-Review: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10gmodena) >>! In T344688#9116488, @BTullis wrote: [...] > So I would be a little ret... [11:24:44] !log btullis@cp3074:~$ sudo systemctl start varnishkafka-webrequest.service [11:24:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:26:02] RECOVERY - Webrequests Varnishkafka log producer on cp3074 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [11:35:26] 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10BTullis) >>! In T338057#9114257, @xcollazo wrote: > Re 3.3 vs 3.4, I am yet do do any tests on 3.4. > > But actually, @BTullis , since I suspect that... [11:41:08] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10Stevemunene) `an-worker1117` is stuck at install with an error no root filesystem is defined. Looking into this. {F37626140} [11:57:06] 10Data-Platform-SRE: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) [11:59:22] 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10BTullis) 05Open→03Resolved a:03BTullis I created {T344910} to track the work on enabling multiple yarn/spark shuffler services. [12:00:20] 10Data-Platform-SRE: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) p:05Triage→03High Prioritizing this work in place of {T338057}. [12:00:34] (03CR) 10Peter Fischer: cirrussearch: add fetch_failure schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse) [12:03:04] 10Data-Engineering, 10Product-Analytics, 10Patch-For-Review: Email notifications of new MediaWiki history snapshot availabilty - https://phabricator.wikimedia.org/T344854 (10CodeReviewBot) joal opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/481 Update mw_history_che... [12:03:41] 10Data-Engineering, 10Product-Analytics, 10Patch-For-Review: Email notifications of new MediaWiki history snapshot availabilty - https://phabricator.wikimedia.org/T344854 (10JAllemandou) We could have done without the above PR, but there was a typo in code which I corrected :) [12:08:36] !log failing over hive to an-coord1002 in advance of reboot of an-coord1001 [12:08:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:30:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: gobblin-webrequest.service,refine_event.service,refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:15] ^expected [12:40:03] !log rebooting an-coord1001 [12:40:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:53:42] (SystemdUnitFailed) firing: (2) gobblin-webrequest.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:58:47] (SystemdUnitFailed) firing: (3) gobblin-webrequest.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:12] 10Data-Engineering, 10Wikidata, 10Wikidata-Query-Service: Publish WDQS JNL files to dumps.wikimedia.org - https://phabricator.wikimedia.org/T344905 (10Gehel) [13:20:36] 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10xcollazo) > Shall we decline this ticket and create a new one for enabling multiple shuffler services? I think this ticket is still relevant medium te... [13:23:42] (SystemdUnitFailed) firing: (2) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:26:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:42] (SystemdUnitFailed) resolved: (2) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:00:05] btullis: is https://gerrit.wikimedia.org/r/c/operations/puppet/+/952215/ an attempt to fix the duplication issues in https://gerrit.wikimedia.org/r/c/operations/puppet/+/950136 ? Wouldn't it make more sense for you to send an additional PS on the initial CR? [14:00:19] btw, thanks a lot on fixing my mess! [16:04:45] gehel: Yes, I didn't know if it was possible to reuse the original CR, given that it had already been merged, then reverted. I just wanted to find a way to run pcc against the changes, since we had lost the original puppet reports. inflatador and stevemunene and I created that new CR whilst working together, but we've worked it out now in https://gerrit.wikimedia.org/r/c/operations/puppet/+/952221 [16:07:09] !log failing over hdfs namenode from an-master1001 to an-master1002 [16:07:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:08:02] https://www.irccloud.com/pastebin/Vy4VDv7Z/ [16:08:12] Phew! Better than last time. [16:09:11] !log failing over yarn resourcemanager to an-master1002 [16:09:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:09:41] https://www.irccloud.com/pastebin/17L4fylF/ [16:10:24] !log about to reboot an-master1001 [16:10:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:14:51] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [16:15:18] (03PS1) 10Phuedx: WIP: Add analytics/metrics_platform/{app,web}_click schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) [16:15:46] (03CR) 10CI reject: [V: 04-1] WIP: Add analytics/metrics_platform/{app,web}_click schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [16:29:37] https://yarn.wikimedia.org/ is not responding. Is this a side effect of the reboot of an-master1001.eqiad.wmnet ? [16:29:47] Yes, precisely. [16:29:52] Got it [16:30:09] T331448 [16:30:10] T331448: Make YARN web interface work with both primary and standby resourcemanager - https://phabricator.wikimedia.org/T331448 [16:30:33] I will fail it back to the master in the next 10 minutes or so. [16:30:48] ahem, primary [16:39:47] xcollazo: I have failed back YARN to the primary. You should be good again for the YARN web ui. [16:39:51] (HdfsCorruptBlocks) resolved: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [16:39:52] https://www.irccloud.com/pastebin/jmcafiuk/ [16:43:04] btullis: ty! [16:43:25] You're welcome. [16:43:45] !log going for failback of HDFS namenode service from an-master1002 to an-master1001 [16:43:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:45:09] Dammit, failed agin. [16:45:13] https://www.irccloud.com/pastebin/G242yWGA/ [16:46:49] !log failback unsuccessful. namenode services still running on an-master1002. [16:46:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:47:14] !log start hadoop namenode on an-master1001 after crash. [16:47:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:53:51] (HdfsFSImageAge) firing: The HDFS FSImage on analytics-hadoop:an-master1002:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [16:55:28] 10Data-Platform-SRE: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) a:03BTullis [16:58:51] (HdfsFSImageAge) firing: The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [17:23:51] (HdfsFSImageAge) firing: The HDFS FSImage on analytics-hadoop:an-master1002:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [17:28:07] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) Adding the `AUTH_OIDC_PREFERRED_JWS_ALGORITHM ` worked and resolved the unsiged token error we had. Datahub can now receive the token from the idp... [17:28:51] (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [17:58:51] (HdfsFSImageAge) resolved: The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [17:58:51] (HdfsFSImageAge) resolved: (2) The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [18:02:01] 10Data-Platform-SRE, 10Data-Services: Queries to externallinks table fail following schema changes - https://phabricator.wikimedia.org/T344866 (10bd808) [18:20:08] !log attempting another failback of the hadoop namenode services [18:20:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:20:31] https://www.irccloud.com/pastebin/PTt0nF04/ [18:20:38] Phew! [18:21:37] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Discovery-Search (Current work), and 2 others: Rollout Elasticsearch extra plugins package and restart cluster to apply - https://phabricator.wikimedia.org/T344366 (10bking) Update: the reindex is taking longer than expe... [18:33:19] 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10Gehel) 05Resolved→03Open Re-opening, Spark 3.x upgrade is still relevant in the medium term [18:35:41] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:35:57] PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [18:50:01] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:50:15] RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [18:51:39] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr) wdqs1017 D2. U38 wdqs1018 E2 U40 wdqs1019. F2. U39 [18:52:12] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr) [18:53:14] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr) a:03Jclark-ctr [19:01:09] 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10BTullis) >>! In T338057#9116936, @xcollazo wrote: >> The fact that we currently ship our yarn shuffler service jars with conda-analytics > Ah, but I t... [21:10:14] (03PS2) 10Phuedx: Add analytics/metrics_platform/{app,web}_click schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) [21:55:08] 10Data-Platform-SRE: Investigate wdqs1005 (hangs/crashes) - https://phabricator.wikimedia.org/T344960 (10bking) [22:11:29] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) Great work on identifying and solving that blocker! I'm not sure that I quite understand yet why you have suggested the three scopes that you have in op... [22:39:39] PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [22:42:15] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:49:31] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:49:43] RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [23:09:47] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:09:59] PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [23:19:57] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:20:09] RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down