[01:16:58] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 0.3368% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:48:27] (SystemdUnitFailed) firing: (14) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:06:05] (KafkaReplicationFactorTooLow) firing: ... [02:06:05] Kafka topic codfw.cpjobqueue.retry.mediawiki.job.ImageSuggestionsNotifications replication factor is too low on main-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?var-kafka_cluster=main-eqiad&var-kafka_broker=All&var-topic=codfw.cpjobqueue.retry.mediawiki.job.ImageSuggestionsNotifications&viewPanel=40 - ... [02:06:05] https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [02:11:05] (KafkaReplicationFactorTooLow) resolved: ... [02:11:05] Kafka topic codfw.cpjobqueue.retry.mediawiki.job.ImageSuggestionsNotifications replication factor is too low on main-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?var-kafka_cluster=main-eqiad&var-kafka_broker=All&var-topic=codfw.cpjobqueue.retry.mediawiki.job.ImageSuggestionsNotifications&viewPanel=40 - ... [02:11:05] https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [04:28:26] (SystemdUnitFailed) firing: (15) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:16:58] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 0.3266% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:19:20] (03PS3) 10Snwachukwu: [WIP] Add Reportupdater Browser All Sites Queries. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/995740 (https://phabricator.wikimedia.org/T354552) [08:08:27] (SystemdUnitFailed) firing: (16) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:16:59] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 0.2512% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:40:02] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [10:01:50] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10BTullis) Yes, the timeout happens reliably at around 65 seconds. Thanks @Vgutierrez - I'll remove the Traffic tag and focus on envoy. [10:10:22] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10BTullis) a:03BTullis [10:38:27] (SystemdUnitFailed) firing: (17) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:14:20] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:30:21] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10BTullis) Thanks again @Vgutierrez - That makes perfect sense now. As per: https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/... [11:58:04] (03PS3) 10Gmodena: WIP - Add webrequest schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/983898 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [11:58:46] (03CR) 10CI reject: [V: 04-1] WIP - Add webrequest schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/983898 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [12:02:38] 10Data-Engineering, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula, and 2 others: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10jcrespo) This is looking good- I will check in th... [12:06:36] (03PS4) 10Gmodena: WIP - Add webrequest schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/983898 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [12:24:55] !log restart jvm services on an-master1004 for T353776 and to pick up new JDK [12:24:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:24:58] T353776: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 [12:36:34] !log failover hadoop namenode to an-master1004 for jvm service restart to pick up new JDK and T353776 [12:36:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:36:37] T353776: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 [12:48:30] !log restart jvm services on an-master1003 for T353776 and to pick up new JDK [12:48:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:48:33] T353776: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 [12:55:20] 10Data-Engineering, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula, 10netbox: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10Volans) 05Open→03Resolved @BTullis thanks for t... [13:01:37] !log failover hadoop namenode back to an-master1003 after the jvm service restart to pick up new JDK and T353776 [13:01:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:01:40] T353776: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 [13:07:30] PROBLEM - Check systemd state on an-master1003 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:32] PROBLEM - Hadoop Namenode - Primary on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [13:08:27] (SystemdUnitFailed) firing: (17) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:10:26] RECOVERY - Check systemd state on an-master1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:28] RECOVERY - Hadoop Namenode - Primary on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [13:11:40] RECOVERY - HDFS topology check on an-master1003 is OK: OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_topology_check [13:13:27] (SystemdUnitFailed) firing: (17) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:16:59] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 0.1815% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [13:23:27] (SystemdUnitFailed) firing: (17) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:25:10] PROBLEM - Check systemd state on an-master1003 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:10] PROBLEM - Hadoop Namenode - Primary on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [13:29:01] 10Data-Engineering (Sprint 8), 10Patch-For-Review: [Iceberg Migration] Migrate session length tables to Iceberg - https://phabricator.wikimedia.org/T352672 (10CodeReviewBot) joal opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/608 Update analytics session_length DAG [13:29:19] (03PS43) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment of account creation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [13:34:44] aqu: Hi! I need a review from you please, on patch just above --^ [13:39:27] RECOVERY - Hadoop Namenode - Primary on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [13:39:59] RECOVERY - Check systemd state on an-master1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:27] (SystemdUnitFailed) firing: (17) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:47:15] (HdfsFSImageAge) firing: The HDFS FSImage on analytics-hadoop:an-master1004:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [13:48:44] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene) The hosts are slowly balancing in the cluster and should help with the low capacity warnings we were getting. {F41803475} Namenodes services have also been restarted... [13:52:15] (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-hadoop:an-master1003:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [14:04:27] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Allow federated queries with the NFDI4Culture Knowledge Graph - https://phabricator.wikimedia.org/T346455 (10Loz.ross) Hi @RKemper, Thanks so much for working on this ticket! We tried a query with my colleagues, but we're gett... [14:13:27] 10Data-Engineering: Update data_quality schemas to be compatible with Iceberg tables - https://phabricator.wikimedia.org/T356866 (10JAllemandou) [14:17:15] (HdfsFSImageAge) firing: The HDFS FSImage on analytics-hadoop:an-master1004:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [14:22:15] (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-hadoop:an-master1003:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [14:28:50] 10Data-Engineering, 10Data Products, 10Observability-Logging, 10Traffic, 10Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10Fabfur) Some updates about the ongoing work: Currently our Benthos configuration produces this output, when fed with HAPr... [14:36:31] 10Data-Engineering (Sprint 8), 10Patch-For-Review: Add `event.app_donor_experience` fields to event sanitization allowlist - https://phabricator.wikimedia.org/T356214 (10mpopov) 05Open→03Resolved a:03mpopov Thanks for merging & deploying, @JAllemandou! @SNowick_WMF: I just confirmed that `event_sanitiz... [14:47:15] (HdfsFSImageAge) resolved: The HDFS FSImage on analytics-hadoop:an-master1003:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [14:47:15] (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-hadoop:an-master1003:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [14:52:15] (HdfsFSImageAge) resolved: (2) The HDFS FSImage on analytics-hadoop:an-master1003:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [15:01:23] 10Data-Engineering, 10Data-Platform-SRE: [superset-k8s] Find a solution for the requestctl-generator html page - https://phabricator.wikimedia.org/T356490 (10brouberol) I investigated how we could solve this, and we have 2 different ways I think we could re-implement the current behavior in k8s. First, we cou... [15:22:29] (03CR) 10Gmodena: data-quality: rename source table column (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/997485 (https://phabricator.wikimedia.org/T356628) (owner: 10Gmodena) [15:32:28] 10Data-Engineering: [Data Quality] Update data_quality schemas to be compatible with Iceberg tables - https://phabricator.wikimedia.org/T356866 (10Ahoelzl) [15:33:36] 10Data-Engineering (Sprint 8): [Data Quality] Update data_quality schemas to be compatible with Iceberg tables - https://phabricator.wikimedia.org/T356866 (10Ahoelzl) [15:35:30] !log rolling out a change of the discovery-uri to presto workers and clients https://gerrit.wikimedia.org/r/c/operations/puppet/+/998425 [15:35:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:41:05] (KafkaReplicationFactorTooLow) firing: ... [15:41:05] Kafka topic codfw.cpjobqueue.retry.mediawiki.job.RebuildMessageGroupStatsJob replication factor is too low on main-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?var-kafka_cluster=main-eqiad&var-kafka_broker=All&var-topic=codfw.cpjobqueue.retry.mediawiki.job.RebuildMessageGroupStatsJob&viewPanel=40 - ... [15:41:05] https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [15:46:03] is there a strong reason for this ^ topic ^ to not be replicated? [15:46:05] (KafkaReplicationFactorTooLow) resolved: ... [15:46:05] Kafka topic codfw.cpjobqueue.retry.mediawiki.job.RebuildMessageGroupStatsJob replication factor is too low on main-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?var-kafka_cluster=main-eqiad&var-kafka_broker=All&var-topic=codfw.cpjobqueue.retry.mediawiki.job.RebuildMessageGroupStatsJob&viewPanel=40 - ... [15:46:05] https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [15:47:29] I guess not hen [15:47:31] brouberol: I wouldn't have thought so, but then I'm not really sure. It's on the main cluster and it's something to do with the Mediwiki job queues, so it's probably a question for ServiceOps. [15:47:31] *then [15:47:59] Did it fix itself, or was it a monitoring anomalomaly? [15:48:19] I think that the topic was new and that it was a case of our monitoring not tolerating null series [15:48:46] as the topic has a replication factor of 3 [15:54:57] is it possible the alert is misnamed? perhaps it is alerting on under replicated partitions? [15:58:09] this monitor triggers when we create a topic with a replication factor = 1, it's not looking at the ISR size IIRC [15:58:32] weird okay [15:59:11] I think it's a case of a topic being new and the data not "filling" the evaluation window [15:59:25] I should probably use max instead of avg. I'll flag that for later [15:59:33] hnowlan: _joe_ ^^ [15:59:43] possibly related to current jobqueue incident? [16:00:15] context: we've been seeing jobqueue errors since 14:55ish https://logstash.wikimedia.org/goto/684a454f5135b7b7fdb695a19b0ec98d [16:00:17] <_joe_> the incident is with eventgate-main [16:00:29] <_joe_> not the jobqueue, that is a consequence [16:01:09] <_joe_> uhm that topic might be related to the script lucas launched? [16:01:34] <_joe_> no the timing isn't right [16:01:59] I looked in kafka, and the topic is fully replicated. I'd tend to file that under "monitoring flukes" except if there's something actively broken with it specifically [16:03:01] <_joe_> brouberol: yeah, no, I think eventgate-main isn't that healthy though [16:03:21] <_joe_> the service mesh is reporting connection errors since 14:55 [16:05:06] anything we can help with? [16:06:57] <_joe_> brouberol: looking into eventgate-main? but right now monitoring seems dead, so I'd wait [16:08:32] ping aqu on reviews please [16:11:34] <_joe_> brouberol: nevermindd, the issues went away on our side a few minutes ago [16:18:59] ack, 👍 [17:16:59] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 0.1132% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:18:27] (SystemdUnitFailed) firing: (18) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:26:34] !log roll-restarting kafka-jumbo for T356382 [17:26:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:28:06] (KafkaReplicationFactorTooLow) firing: (20) Kafka topic codfw.android.breadcrumbs_event replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [17:31:08] Ping gmodena if you can: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/998482 [17:31:16] otherwise I'll do it myself :) [17:33:06] (KafkaReplicationFactorTooLow) resolved: (20) Kafka topic codfw.android.breadcrumbs_event replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [17:39:35] as regards earlier - it looks like there were issues connecting to both eventgate-main and eventgate-analytics during the window of errors https://grafana.wikimedia.org/goto/2ap2i72Iz?orgId=1 [17:44:05] (KafkaReplicationFactorTooLow) firing: (55) Kafka topic DataHubUpgradeHistory_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [17:49:07] (KafkaReplicationFactorTooLow) resolved: (55) Kafka topic DataHubUpgradeHistory_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [17:50:05] (KafkaReplicationFactorTooLow) firing: (17) Kafka topic android replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [17:55:07] (KafkaReplicationFactorTooLow) resolved: (72) Kafka topic DataHubUpgradeHistory_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [17:56:53] (KafkaReplicationFactorTooLow) firing: (52) Kafka topic DataHubUpgradeHistory_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [18:01:39] (KafkaReplicationFactorTooLow) resolved: (87) Kafka topic DataHubUpgradeHistory_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [18:10:51] Starting build #137 for job analytics-refinery-maven-release-docker [18:12:53] (KafkaReplicationFactorTooLow) firing: (166) Kafka topic codfw.android.product_metrics.article_toolbar_interaction replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [18:13:35] (KafkaReplicationFactorTooLow) resolved: (15) Kafka topic codfw.ios.setting_action replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [18:17:57] (KafkaReplicationFactorTooLow) firing: (229) Kafka topic SpecialInvestigate replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [18:18:47] (KafkaReplicationFactorTooLow) resolved: (229) Kafka topic SpecialInvestigate replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [18:25:42] 10Data-Engineering (Sprint 9): [Data Quality] Update data_quality schemas to be compatible with Iceberg tables - https://phabricator.wikimedia.org/T356866 (10Ahoelzl) [18:27:25] Project analytics-refinery-maven-release-docker build #137: 09SUCCESS in 16 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/137/ [18:35:04] 10Data-Platform-SRE, 10SRE-Access-Requests: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Dzahn) Should we close this ticket as "invalid"? It seems the best course of action might be a new ticket like "migrate all WMDE pipelines to airflow" an... [18:35:47] 10Data-Platform-SRE, 10SRE-Access-Requests: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Dzahn) 05Open→03In progress [18:49:07] 10Data-Platform-SRE, 10SRE-Access-Requests: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10mpopov) Not yet. I believe @AndrewTavis_WMDE will be sharing some findings from WMDE side soon. [18:55:29] Starting build #98 for job analytics-refinery-update-jars-docker [18:55:55] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.32 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/998275 [18:55:56] Project analytics-refinery-update-jars-docker build #98: 09SUCCESS in 26 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/98/ [19:08:28] (SystemdUnitFailed) firing: (19) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:16:51] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10mpopov) Tagging DPE SRE in case this is specific to those tools. @cchen: Can you please verify if you can `ssh` to the stat hosts and also use Jupyte... [19:17:43] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): [Iceberg Migration] P.O.C. on Iceberg sensor using Postgres table to keep status of updates - https://phabricator.wikimedia.org/T340466 (10BTullis) @mforns - Would you like us to go ahead and create this database (or these databases) for you? H... [19:17:50] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): [Iceberg Migration] P.O.C. on Iceberg sensor using Postgres table to keep status of updates - https://phabricator.wikimedia.org/T340466 (10BTullis) a:03BTullis [19:18:51] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Bring stat1011 into service - https://phabricator.wikimedia.org/T354526 (10BTullis) Puppet is now running cleanly on stat1011, so I think that it is just a case of announcing it the the users and adding the docs to Wikitech. [19:22:14] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10cchen) I `ssh` the stats machine and `kinit`, and got `Password incorrect while getting initial credentials. and I also tried JupyterHub, and it also... [19:23:28] (SystemdUnitFailed) firing: (20) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:24:39] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/998275 (owner: 10Maven-release-user) [19:26:02] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997960 (https://phabricator.wikimedia.org/T347879) (owner: 10Joal) [19:28:28] (SystemdUnitFailed) firing: (20) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:28:52] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) @cchen When you ran kinit the first time after you logged in, did it ask you to change the password? Did you get a new temporary one by mail?... [19:36:02] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10cchen) @Dzahn Oh, I see. I found the email and reran the kinit with the temporary password, it works now. [19:38:33] 10Data-Engineering (Sprint 8), 10Patch-For-Review: [BUG] webrequest analyzer DQ jobs fails to store data - https://phabricator.wikimedia.org/T356401 (10CodeReviewBot) gmodena opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/609 analytics: webrequest: version bump refine... [19:43:28] (SystemdUnitFailed) firing: (20) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:49:28] !log Release refinery-source v0.2.32 [19:49:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:49:38] !log Deployed refinery using scap [19:49:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:49:49] !log deploying Refinery onto HDFS [19:49:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:56:26] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin2002 for hosts: `cloudelastic1008.wikimedia.org` - cloudelastic1008.wikim... [19:58:14] 10Data-Engineering (Sprint 8), 10Patch-For-Review: [Iceberg Migration] Migrate session length tables to Iceberg - https://phabricator.wikimedia.org/T352672 (10CodeReviewBot) joal merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/608 Update analytics session_length DAG [20:01:23] 10Data-Engineering (Sprint 8), 10Patch-For-Review: [BUG] webrequest analyzer DQ jobs fails to store data - https://phabricator.wikimedia.org/T356401 (10CodeReviewBot) joal merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/609 analytics: webrequest_analyzer: version bump... [20:08:28] (SystemdUnitFailed) firing: (19) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:09:07] !log Relaunch druid_load_unique_devices_per_domain_daily_aggregated_monthly after deploy [20:09:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:09:50] hnowlan: _joe_ eventgate-main and eventgate-analytics at the same time? [20:10:12] that's strange [20:10:19] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) 05In progress→03Resolved Great! Feel free to reopen the ticket if there is anything else missing. [20:17:38] Relaunch session_length_daily failed task [20:17:42] !log Relaunch session_length_daily failed task [20:17:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:17:48] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 10 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Sbailey) [20:25:38] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 10 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Sbailey) Chromium render (Proton) upgraded from Node 12 to Node 18 [content transform] update to 2024-02-05-181957-pr... [20:28:29] (SystemdUnitFailed) firing: (20) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:58:28] (SystemdUnitFailed) firing: (20) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:08:29] (SystemdUnitFailed) firing: (20) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:16:59] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 1.572e-05% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:18:28] (SystemdUnitFailed) firing: (20) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:28:14] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10mpopov) @cchen: How about Superset & Hue? [21:28:28] (SystemdUnitFailed) firing: (20) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:31:10] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10ops-eqiad: Comm Error: Backplane 0 on cloudelastic1008 - https://phabricator.wikimedia.org/T356919 (10bking) [21:42:39] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10cchen) I still not able to access Superset & Hue, and i tried to reset my password again, still not working. [21:49:08] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10RhinosF1) You look still to be blocked on wikitech https://wikitech.wikimedia.org/wiki/Special:Contributions/Conniecc1 - not sure if that's related bu... [21:49:12] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) 05Resolved→03Open [21:49:22] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) a:05cchen→03None [21:50:53] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10RhinosF1) [21:52:12] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10RhinosF1) I've added a checklist based on the private task. @MoritzMuehlenhoff (or another SRE): please update based on what already works @cchen: i... [23:04:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [23:08:28] (SystemdUnitFailed) firing: (19) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:23:29] (SystemdUnitFailed) firing: (20) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:59:31] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Stale data/failed queries on wikidatawiki index - https://phabricator.wikimedia.org/T356941 (10bking)