[00:53:57] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:00:42] 10Data-Engineering: Add analytics-platform-eng-admins on stat* hosts - https://phabricator.wikimedia.org/T333264 (10Htriedman) [03:55:34] 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Niharika) I'll note that from a product perspective even though we would *like* to not change the prefix pattern on the temp user, it still is a config variable that is s... [04:52:22] 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Marostegui) >>! In T333223#8731730, @Ladsgroup wrote: > Yup, adding a new column is not that hard. There are some documentation on this https://wikitech.wikimedia.org/wik... [04:53:53] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:47:54] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [08:11:46] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10Stevemunene) [08:12:28] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10Stevemunene) [08:38:05] (03CR) 10Btullis: "recheck" [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/903262 (https://phabricator.wikimedia.org/T303381) (owner: 10Btullis) [08:53:58] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:58:07] 10Analytics-Radar, 10Infrastructure-Foundations, 10netops: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10aborrero) 05Resolved→03Open This happened to me today in a couple of hardware servers, see {T333281} and {T333282}. [09:07:13] (03PS2) 10Btullis: Tweak the build process and and fix local container builds [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/903262 (https://phabricator.wikimedia.org/T303381) [09:09:01] 10Analytics-Radar, 10Infrastructure-Foundations, 10netops: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10cmooney) @aborrero do you have more details on what happened with those? I'm not sure the symptoms are the same. In the Ganeti case the hyperviso... [09:11:43] 10Analytics-Radar, 10Infrastructure-Foundations, 10netops: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10aborrero) >>! In T273026#8732900, @cmooney wrote: > @aborrero do you have more details on what happened with those? > > I'm not sure the symptoms... [09:17:03] (03PS10) 10Jennifer Ebe: T305842-Migrate-The-Referrer-Job-Daily-hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/902068 [09:22:32] (03CR) 10Joal: "LGTM! Thanks Jennifer :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/902068 (owner: 10Jennifer Ebe) [09:22:37] (03CR) 10Joal: [C: 03+1] T305842-Migrate-The-Referrer-Job-Daily-hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/902068 (owner: 10Jennifer Ebe) [09:24:11] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for next deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/902068 (owner: 10Jennifer Ebe) [09:31:50] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Joe) Sorry, I'm getting confused; to my understanding, WDQS... [09:37:41] * btullis I have created a patch to disable ingestion to HDFS: https://gerrit.wikimedia.org/r/c/operations/puppet/+/903610 - To be deployed at around 12:50 UTC today. [09:46:06] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10dcausse) >>! In T330507#8732991, @Joe wrote: > Sorry, I'm ge... [09:51:52] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis) [09:55:36] (03CR) 10Btullis: [C: 03+2] Tweak the build process and and fix local container builds [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/903262 (https://phabricator.wikimedia.org/T303381) (owner: 10Btullis) [09:56:25] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by stevemunene@cumin1001 for host an-test-c... [09:56:56] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:56:56] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10Stevemunene) Ozzie packages which are now deemed deprecated are omitted from bullseye going forward per T333295.... [10:01:24] 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Ladsgroup) oh thanks. That means it'll take a month or two at most (after schema change patch getting merged in core) [10:20:02] (03Merged) 10jenkins-bot: Tweak the build process and and fix local container builds [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/903262 (https://phabricator.wikimedia.org/T303381) (owner: 10Btullis) [10:32:34] 10Analytics-Radar, 10Infrastructure-Foundations, 10netops: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10cmooney) >>! In T273026#8732916, @aborrero wrote: > I don't know exactly what happened. > > My hunch is that the systemd service has been in faile... [10:46:38] !log failing over hive services to an-coord1002 prior to switch upgrade. [10:46:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:01:37] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Research and test methods for accessing kerberized services from spark running on the DSE K8S cluster - https://phabricator.wikimedia.org/T330162 (10BTullis) Marking this ticket as Done. We have a draft document: [[https://docs.google.com/document/d/1... [11:23:34] (SystemdUnitFailed) resolved: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:23:52] elukey: o/ I wonder if I could possibly get your opinion on https://gerrit.wikimedia.org/r/c/operations/puppet/+/903627 as I've never done this before. [11:23:54] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:28:08] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis) [11:30:30] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fnegri) I "depooled" dbproxy1019 by following the procedure at https://wikitech.wikimedia.org/w/index.php?title=Portal:Data_Services... [11:34:12] Hi all, I tried querying wikishared via superset's SQL lab, and I keep getting an unknown error. Am I doing something wrong? [11:34:16] This is my query https://usercontent.irccloud-cdn.com/file/yiHsOYtK/image.png [11:40:34] urbanecm: No, you're not. It's a missing permission which started hitting people after the recent upgrade to version 1.5.3 - https://phabricator.wikimedia.org/T328457 I can add the sql_lab permission to you account. Hong on a sec. [11:42:01] urbanecm: Can you try again now please? I've added you manually, but we're discussing how best to do this automatically. [11:43:33] It works now, thanks btullis. Would you mind adding https://ldap.toolforge.org/user/etonkovidova too (QA from Growth, with need to see those tables too), please? [11:44:25] btullis: o/ it has been a while but I think it should do it! [11:44:50] Yarn will need a restart, and after that you should see the UI updated IIRC [11:44:58] lemme know if I can help :) [11:46:40] elukey: Great, yeah I found it from the old IRC logs: https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-analytics/20210525.txt - I'm going to add the howto to wikitech. :-) [11:46:58] I also added some info on safe mode: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration#Safe_Mode [11:47:25] wow throwback tuesday :D [11:47:27] urbanecm: Great, will do. It might be later this afternoon. [11:47:34] ty [11:48:21] elukey: So it should be a `yarn rmadmin -refreshQueues` (plus appropriate run-kerberos-command` after deploy, right? Found it here: https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_yarn-resource-management/content/starting_and_stopping_queues.html [11:48:43] yes yes now I recall, in theory it should be it [11:49:08] Great, thanks. I'll give it a crack and only bother you if I get in a tizz :-) [11:49:15] please :) [11:53:05] btullis: Heya - I'm around if you need an eye during the operation [12:05:11] joal: oh yes please. Much appreciated. [12:07:50] 10Data-Engineering: Add analytics-platform-eng-admins on stat* hosts - https://phabricator.wikimedia.org/T333264 (10Ottomata) @Htriedman can you describe what you are trying to do? You mentioned you want to use /srv/published (from stat machines), but want to use the analytics-platform-eng system user, which is... [12:11:54] urbanecm: I've added the sql_lab role to etonkovidova as requested. [12:12:21] ty btullis! [12:19:11] 10Data-Engineering-Planning, 10Data Pipelines: Review Superset permissions and assign roles as appropriate - https://phabricator.wikimedia.org/T328457 (10BTullis) > let's just default granting sql_lab access to all Superset accounts. I agree in principle and I'll start looking at how best to configure this. H... [12:27:19] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) Oh, I misunderstood, I thought that WDQS updater w... [12:29:10] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi) [12:29:37] 10Data-Engineering-Planning, 10Data Pipelines: Review Superset permissions and assign roles as appropriate - https://phabricator.wikimedia.org/T328457 (10BTullis) I have added the `sql_lab` role to @JTannerWMF (`jaz`), @Aline_Bruenger_WMDE (`alinebruenger`), @Siko_WMDE (`siko`), @Jgiannelos (`jgiannelos`), @Ur... [12:34:25] 10Data-Engineering-Planning, 10Data Pipelines: Review Superset permissions and assign roles as appropriate - https://phabricator.wikimedia.org/T328457 (10Jgiannelos) Thanks @BTullis i just verified i now can run queries in superset. [12:37:29] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10dcausse) >>! In T330507#8734369, @Ottomata wrote: > Oh, I mi... [12:46:36] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team, 10Observability-Logging, and 2 others: Evaluate Benthos as stream processor - https://phabricator.wikimedia.org/T319214 (10elukey) 05Open→03Resolved a:03elukey Closing since we have been using benthos for a while :) [12:50:38] !log merging the change to disable ingestion to HDFS https://gerrit.wikimedia.org/r/c/operations/puppet/+/903610 [12:50:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:50:41] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ssingh) [12:58:25] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eq... [13:02:35] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis) [13:17:56] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eq... [13:22:50] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ssingh) [13:29:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:29:54] joal: I'm about ready to stop the YARN queues. We still have a little bit of writing going on, but I've warned as many people as possible to stop their jobs if possible. [13:31:15] !log setting all four YARN queues to STOPPED https://gerrit.wikimedia.org/r/c/operations/puppet/+/903627 T330165 [13:31:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:31:19] T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 [13:35:51] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis) [13:37:03] !log refreshed YARN queues with: `sudo kerberos-run-command yarn /usr/bin/yarn rmadmin -refreshQueues` on both an-master100[1-2] - T330165 [13:37:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:37:07] T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 [13:41:05] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Jelto) [13:44:18] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10herron) [13:45:46] btullis: all goo from your side? [13:46:07] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Jelto) [13:48:31] (SystemdUnitFailed) firing: (8) refine_netflow.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:49:41] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4c1e12e1-9d5e-4447-880a-f0ec09133a64) set by ayounsi@cumin1001 for... [13:49:56] joal: We still have more writes going to HDFS than I would like. I'm going to have to put it into read-only mode very soon. [13:51:36] btullis: there is only one prod job running [13:52:35] btullis: last prod job done! [13:52:36] joal: Yes. All the rest are user jobs and they have been adequately warned. [13:52:47] Great! [13:53:02] Happy for me to go ahead with safe mode? [13:54:03] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10jbond) [13:54:10] please do btullis [13:54:52] !log entering safe mode for analytics-hadoop cluster: T330165 [13:54:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:55:00] T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 [13:55:10] https://www.irccloud.com/pastebin/AiyJzotA/ [13:55:43] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10MatthewVernon) [13:56:31] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis) [13:59:31] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [14:01:59] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10jbond) [14:08:31] (SystemdUnitFailed) firing: (6) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:12:26] PROBLEM - MariaDB Replica IO: analytics-meta-replica on an-coord1002 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@an-coord1001.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on an-coord1001.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:12:56] ACKNOWLEDGEMENT - MariaDB Replica IO: analytics-meta-replica on an-coord1002 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@an-coord1001.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on an-coord1001.eqiad.wmnet (110 Connection timed out) Btullis T330165 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a [14:12:56] a [14:14:01] (SystemdUnitFailed) firing: (6) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:17:12] PROBLEM - Host analytics1069 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:40] RECOVERY - MariaDB Replica IO: analytics-meta-replica on an-coord1002 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:18:55] (GobblinLastSuccessfulRunTooLongAgo) firing: Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [14:24:36] (GobblinLastSuccessfulRunTooLongAgo) firing: Last successful gobblin run of job eventlogging_legacy was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=eventlogging_legacy - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [14:25:12] !log restarting hive-server2 and hive-metastore services on an-coord1001 [14:25:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:25:49] !log proceeding to take HDFS out of safe mode. [14:25:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:26:53] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [14:30:10] HDFS back out of saf mode and blocks are being written again. [14:30:12] https://usercontent.irccloud-cdn.com/file/0Hh6I5jS/image.png [14:31:16] !log re-enabling YARN queues: https://gerrit.wikimedia.org/r/c/operations/puppet/+/903565 T330165 [14:31:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:31:20] T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 [14:32:48] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqia... [14:35:03] !log re-enabling gobblin timers: https://gerrit.wikimedia.org/r/c/operations/puppet/+/903668 T330165 [14:35:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:35:33] 10Data-Engineering, 10Advanced-Search, 10All-and-every-Wikisource, 10ArticlePlaceholder, and 69 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10kostajh) [14:40:07] feels like a succesfull operation btullis - right? [14:41:09] joal: Pretty good so far. I've got timeouts running puppet agent on an-master nodes, which means that the YARN queues are still disabled. [14:41:29] ack [14:41:35] I've temporarily disabled puppet on an-launcher1002 so that it doesn't re-enable gobblin until thie is done. [14:46:04] Seems widespread. [14:47:59] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqia... [14:48:55] (GobblinLastSuccessfulRunTooLongAgo) firing: Last successful gobblin run of job netflow was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=netflow - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [14:50:12] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi) [14:50:26] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi) [14:53:31] (SystemdUnitFailed) firing: (7) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:54:46] https://www.irccloud.com/pastebin/YHAqj1eB/ [14:58:48] (GobblinLastSuccessfulRunTooLongAgo) firing: (2) Last successful gobblin run of job netflow was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [14:59:12] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) @Joe we discussed the use of page_content_change i... [14:59:55] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi) The switch upgrade itself went smoothly as well, like the other rows. One issue was that gerrit1001 was missing from the l... [15:02:33] OK, puppet situation resolved. YARN queues RUNNING again. Now running puppet on an-launcher1002 to re-enable the gobblin jobs. [15:02:53] \o/ [15:08:06] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by stevemunene@cumin1001 for host an-test-clien... [15:13:48] (GobblinLastSuccessfulRunTooLongAgo) firing: (2) Last successful gobblin run of job netflow was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [15:19:23] 10Quarry: [bug] "Internal Server Error" when logging into Quarry - https://phabricator.wikimedia.org/T333043 (10MarcoAurelio) Trying to log-in to Quarry today via https://quarry.wmcloud.org/oauth-callback?oauth_verifier=[redacted]&oauth_token=[redacted] failed with: ` Internal Server Error The server encounter... [15:21:15] I'm about to kick off a refinery deploy shortly. I see that there are three things in the etherpad, all of which are already merged: https://etherpad.wikimedia.org/p/analytics-weekly-train [15:21:24] Does anyone have anything else to be deployed today? [15:28:48] (GobblinLastSuccessfulRunTooLongAgo) firing: (3) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [15:33:48] (GobblinLastSuccessfulRunTooLongAgo) firing: (2) Last successful gobblin run of job eventlogging_legacy was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [15:34:42] 10Quarry: [bug] "Internal Server Error" when logging into Quarry - https://phabricator.wikimedia.org/T333043 (10rook) ` Mar 28 15:33:47 quarry-web-02 uwsgi-quarry-web[17729]: [pid: 17737|app: 0|req: 44/125] 172.16.5.238 () {52 vars in 851 bytes} [Tue Mar 28 15:33:46 2023] GET /login?next=/ => generated 567 bytes... [15:44:58] 10Data-Engineering: Add analytics-platform-eng-admins on stat* hosts - https://phabricator.wikimedia.org/T333264 (10Htriedman) @Ottomata Any of these options would work for me: # enabling `analytics-platform-eng` on stat machines # enabling data publication to `/srv/published` from airflow machines # enabling d... [15:45:50] !log proceeding with a refinery deploy [15:46:12] 10Data-Engineering, 10Data-Services: Wiki replicas are not fully setup for newly created wikis - https://phabricator.wikimedia.org/T315442 (10MarcoAurelio) Hello @BTullis - Any updates? Some tools are not working for us due to those wikis not being in `meta_p`. Regards. [15:46:55] 10Data-Engineering: Add analytics-platform-eng-admins on stat* hosts - https://phabricator.wikimedia.org/T333264 (10Ottomata) > trying to publish DP data from hive for the last few weeks Right, so can you explain what you are trying to do? That will help us figure out the best approach. Feel free to repurpos... [15:48:48] (GobblinLastSuccessfulRunTooLongAgo) resolved: Last successful gobblin run of job webrequest was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=webrequest - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [15:51:06] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi) [15:51:20] 10Data-Engineering, 10Data-Services: Wiki replicas are not fully setup for newly created wikis - https://phabricator.wikimedia.org/T315442 (10BTullis) Hi @MarcoAurelio - Apologoes for the delay. I'll update these asap. [15:53:31] (SystemdUnitFailed) firing: (8) refine_netflow.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:54:55] 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, 10Infrastructure-Foundations, and 6 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) [15:55:04] 10Data-Engineering: Add analytics-platform-eng-admins on stat* hosts - https://phabricator.wikimedia.org/T333264 (10Htriedman) 05Open→03Resolved a:03Htriedman Resolving this ticket and add my usecase to T317167 [15:55:29] 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, 10Infrastructure-Foundations, and 6 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) [15:58:33] !log deploying refinery to HDFS [15:58:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:03:18] I'm going to wait before deploying airflow, to make sure I don't get the branch wrong or anything. [16:03:35] ack btullis - we're in standup [16:03:55] Sorry, I have a clash with the k8s-sig meeting. Can't make standup. [16:05:10] np btullis [16:19:39] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi) [16:34:51] 10Quarry: [bug] "Internal Server Error" when logging into Quarry - https://phabricator.wikimedia.org/T333043 (10rook) This may be a cookie issue. When logging in from a private window I cannot recreate on the first login, each time I've tried I've been sent to the wikimedia login page. After logging in, I can re... [16:36:35] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10MatthewVernon) [16:36:57] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10MatthewVernon) [17:13:45] 10Data-Engineering, 10Advanced-Search, 10All-and-every-Wikisource, 10ArticlePlaceholder, and 68 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson) [17:18:31] (SystemdUnitFailed) firing: (8) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:20:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:21:49] mforns: wanna talk about druid laoding a bot more? [17:21:56] yes joal :] [17:22:02] cave? [17:22:03] batcave! [17:26:32] 10Data-Engineering-Planning, 10Data Pipelines: Support for moving data from HDFS to public http file server - https://phabricator.wikimedia.org/T317167 (10Ottomata) Cool! ^ almost sounds like something it would be nice to have in AQS! Not sure if there is desire/bandwidth to do that though. @Milimetric ? [17:30:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:33:31] (SystemdUnitFailed) firing: (8) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:37:59] 10Data-Engineering: Add analytics-platform-eng-admins on stat* hosts - https://phabricator.wikimedia.org/T333264 (10Dzahn) 05Resolved→03Declined [18:12:09] 10Data-Engineering-Planning, 10Data Pipelines: Support for moving data from HDFS to public http file server - https://phabricator.wikimedia.org/T317167 (10Htriedman) @Ottomata: @Milimetric and I have talked about adding this data to AQS at some point in the short-/mid-term future, but I think we're going to wa... [19:16:44] 10Data-Engineering-Planning, 10Data Pipelines: Support for moving data from HDFS to public http file server - https://phabricator.wikimedia.org/T317167 (10Ottomata) Okay, yeah, the best thing to do would be as @JAllemandou suggests I think then. It is a non airflow node & user specific solution. @lbowmaker w... [19:49:47] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Eevans) [19:51:37] 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Eevans) [20:21:32] (03CR) 10Krinkle: Add first input delay schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/902693 (https://phabricator.wikimedia.org/T332012) (owner: 10Lgaulia) [20:40:11] 10Data-Engineering, 10Observability-Alerting, 10Patch-For-Review: Migrate eventlogging check_prometheus checks to alertmanager - https://phabricator.wikimedia.org/T309007 (10cmooney) 05Open→03Resolved [20:53:33] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10colewhite) [20:56:16] 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10colewhite) [21:30:31] 10Analytics, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), 10Documentation, and 3 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10apaskulin) [21:33:31] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed