[06:29:48] 10Data-Engineering-Planning, 10Data-Catalog: Establish a Business Glossary - https://phabricator.wikimedia.org/T311524 (10Aklapper) a:05EChetty→03None Removing inactive assignee (please do so as part of team offboarding!). [06:29:54] 10Data-Engineering-Planning, 10Data-Catalog: Document Two Additional Canonical Datasets - https://phabricator.wikimedia.org/T308048 (10Aklapper) a:05EChetty→03None Removing inactive assignee (please do so as part of team offboarding!). [06:29:58] 10Data-Engineering: [Anomaly detection] Create a heatmap view in Superset - https://phabricator.wikimedia.org/T301572 (10Aklapper) a:05EChetty→03None Removing inactive assignee (please do so as part of team offboarding!). [06:30:09] 10Data-Engineering, 10Anti-Harassment, 10Metrics-Platform-Planning, 10Privacy Engineering, and 3 others: Measure user-agent client hints already sent in browsers requests - https://phabricator.wikimedia.org/T299397 (10Aklapper) a:05EChetty→03None Removing inactive assignee (please do so as part of team... [06:30:33] 10Data-Engineering-Planning, 10Data-Catalog: Adding Datasets: MediaWiki History - https://phabricator.wikimedia.org/T307701 (10Aklapper) a:05EChetty→03None Removing inactive assignee (please do so as part of team offboarding!). [06:30:37] 10Data-Engineering: Data structuring guidance request - https://phabricator.wikimedia.org/T287402 (10Aklapper) a:05EChetty→03None Removing inactive assignee (please do so as part of team offboarding!). [07:46:23] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate rdf_streaming_updater_reconcile.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329879 (10pfischer) a:03pfischer [08:19:12] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10fgiunchedi) [08:26:55] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate rdf_streaming_updater_reconcile.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329879 (10pfischer) a:05pfischer→03dcausse [08:31:26] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 (10nfraison) [08:31:29] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate slownesses on an-worker1132 - https://phabricator.wikimedia.org/T330979 (10nfraison) [08:38:39] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate DB connection issues faced from airflow on an-aluncher1002 - https://phabricator.wikimedia.org/T331265 (10nfraison) [08:42:05] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate DB connection issues faced from airflow on an-aluncher1002 - https://phabricator.wikimedia.org/T331265 (10nfraison) [08:42:19] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate DB connection issues faced from airflow on an-aluncher1002 - https://phabricator.wikimedia.org/T331265 (10nfraison) Looks to me that we reach 100% network usage on an-launcher1002 when the connection issues hap... [08:45:51] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10MoritzMuehlenhoff) [09:05:36] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate DB connection issues faced from airflow on an-aluncher1002 - https://phabricator.wikimedia.org/T331265 (10nfraison) Trying to identify process which could generate this tx network usage: - running a PS command... [09:09:10] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate DB connection issues faced from airflow on an-aluncher1002 - https://phabricator.wikimedia.org/T331265 (10nfraison) a:03nfraison [09:23:41] !log Reimage an-conf1001 to upgrade to bullseye T329362 [09:23:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:23:46] T329362: Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 [09:36:47] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-conf10... [09:49:16] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-conf1001.e... [09:49:43] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-conf10... [10:12:28] RECOVERY - Check systemd state on an-airflow1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:13] joal: otto: nfraison: steve_munene: Here is a patch to disabled gobblin ingestion prior to tomorrow's switch maintenance: https://gerrit.wikimedia.org/r/c/operations/puppet/+/894537 [10:14:28] ottomata: ^ [10:15:09] I'm wondering if there is a cleaner way that doesn't delete historical logs. I could just disable puppet and disable the timers manually, I suppose? [10:15:27] !log deploy mediawiki_history_reduced_2023_02 snapshot to AQS [10:15:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:17:37] interesting btullis how long will they be disabled? [10:17:54] btullis: if it is for swithc maintenance, you'll probably also want to disable the test cluster ones too [10:19:28] Well, the maintenance windows on the switch is 2 hours, but with luck the switch upgrade will only take 30 minutes. We want to put HDFS into safe mode prior to the start of the maintenance window. [10:20:31] So, I'd say probably 90 minutes to allow for: disabling timers => allowing currently running timers to finish => entering safe mode => switch upgrade => exiting safe mode [10:21:44] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:23:27] ottomata: Yes, you're probably right. We will lose access to an-test-master1001 but we can fail over. Then we will lose 1/3 of the test cluster workers (an-test-worker1001) so yes we want to put these into safe mode too. [10:24:01] btullis: i think absenting the timers is fine. disabling via stoppping puppet is fine too. [10:24:13] i don't mind losing the timer (journalctl? ) history [10:24:20] the gobblin job status history is in HDFS [10:24:22] I'm planning to fail over the prod master to an-master1002 today, during a quiet period for the cluster. [10:24:33] and re-ensuring the timers after the window should just work fine [10:24:43] when we reenable [10:24:52] we should keep an aye on webrequest lag [10:24:54] it should catch up [10:24:57] but it might take a while... :) [10:25:02] Ah, great. I hadn't thought of the gobblin history being in HDFS. I'll update the patch with the test cluster too. [10:27:02] (03CR) 10Kosta Harlan: [C: 03+1] Remove SpecialMuteSubmit allowlist entry [analytics/refinery] - 10https://gerrit.wikimedia.org/r/893998 (https://phabricator.wikimedia.org/T329718) (owner: 10Phuedx) [10:29:18] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-conf1001.e... [10:32:28] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:32:43] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 (10nfraison) an-conf1001 reimaged but zookeeper not starting This was due to /etc/zookeeper/conf/version-2/ not... [11:10:44] 10Analytics, 10Data-Engineering, 10Event-Platform Value Stream, 10Infrastructure-Foundations: > ~1 request/second to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10MatthewVernon) [11:43:12] 10Data-Engineering, 10Event-Platform Value Stream: Upgrade flink-kubernetes-operator to 1.4.0 - https://phabricator.wikimedia.org/T331282 (10Ottomata) [11:44:12] 10Data-Engineering, 10Event-Platform Value Stream: Upgrade flink-kubernetes-operator to 1.4.0 - https://phabricator.wikimedia.org/T331282 (10Ottomata) [11:46:08] 10Data-Engineering, 10Event-Platform Value Stream: Store Flink HA state in Zookeeper for mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T331283 (10Ottomata) [11:47:58] 10Data-Engineering, 10Event-Platform Value Stream: Upgrade flink-kubernetes-operator to 1.4.0 - https://phabricator.wikimedia.org/T331282 (10Ottomata) [11:48:23] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10Ottomata) [12:26:16] !log Reimage an-conf1002 to upgrade to bullseye T329362 [12:26:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:26:20] T329362: Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 [12:31:03] 10Data-Engineering, 10Event-Platform Value Stream: Store Flink HA metadata in Zookeeper for mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T331283 (10Ottomata) [12:31:08] 10Data-Engineering, 10Event-Platform Value Stream: Upgrade flink-kubernetes-operator to 1.4.0 - https://phabricator.wikimedia.org/T331282 (10Ottomata) [12:31:16] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:33:03] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-conf10... [12:42:04] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:57:49] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): mediawiki-event-enrichment should support the latest eventutilities-python changes - https://phabricator.wikimedia.org/T330994 (10gmodena) [13:08:23] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-conf1002.e... [13:08:34] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 (10nfraison) an-conf1002 done [13:17:57] !log failing over analytics test cluster namenode service to an-test-master1002 T329073 [13:18:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:18:01] T329073: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 [13:24:15] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10BTullis) [13:31:15] (03PS1) 10DCausse: ProduceCanaryEvents: set a timeout on the http client [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/894642 (https://phabricator.wikimedia.org/T330236) [13:37:30] (03PS2) 10DCausse: ProduceCanaryEvents: set a timeout on the http client [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/894642 (https://phabricator.wikimedia.org/T330236) [13:50:42] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Finalize the multi-dc configuration of AQS (nodejs) in codfw - https://phabricator.wikimedia.org/T331115 (10JArguello-WMF) [13:53:23] !log failing over the production hadoop cluster namenode service to an-master1002 [13:53:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:53:55] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): [Flink Operation] How to handle app upgrades - https://phabricator.wikimedia.org/T328569 (10lbowmaker) [13:54:24] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): [Flink Operations] How to handle restarting a Flink application - https://phabricator.wikimedia.org/T328563 (10lbowmaker) [13:55:36] Hadoop namenode failover successful [13:55:40] https://www.irccloud.com/pastebin/eTbbdJ33/ [14:01:50] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [14:07:15] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Finalize the multi-dc configuration of AQS (nodejs) in codfw - https://phabricator.wikimedia.org/T331115 (10BTullis) I have had several discussions about this now and the consensus seems to be that:... [14:08:47] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 09), 10Patch-For-Review: Upgrade flink-kubernetes-operator to 1.4.0 - https://phabricator.wikimedia.org/T331282 (10Ottomata) a:03Ottomata [14:08:51] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 09), 10Patch-For-Review: Upgrade flink-kubernetes-operator to 1.4.0 - https://phabricator.wikimedia.org/T331282 (10Ottomata) [14:10:58] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09), 10Patch-For-Review: Streaming services errors should be routed to an error event topic. - https://phabricator.wikimedia.org/T326536 (10lbowmaker) [14:13:15] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): Document Flink job deployment to k8s - https://phabricator.wikimedia.org/T329629 (10lbowmaker) [14:14:16] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): Flink + Event Platform integration for writing into streams via Table API - https://phabricator.wikimedia.org/T324114 (10lbowmaker) [14:23:07] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate DB connection issues faced from airflow on an-launcher1002 - https://phabricator.wikimedia.org/T331265 (10nfraison) [14:34:35] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10BTullis) [14:42:12] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10MatthewVernon) [14:48:45] (03PS1) 10Aqu: Migrate refine webrequest to Airflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073) [15:11:54] PROBLEM - Hadoop HDFS Namenode FSImage Age on an-master1002 is CRITICAL: FILE_AGE CRITICAL: /srv/hadoop/name/current/VERSION is 7220 seconds old and 217 bytes https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:13:05] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate DB connection issues faced from airflow on an-launcher1002 - https://phabricator.wikimedia.org/T331265 (10nfraison) An idea was that reportupdater job can be the root cause of that high tx Here is the log of al... [15:26:16] !log deployed airflow analytics to unbreak druid-load-edit-hourly [15:26:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:28:50] (03CR) 10Ottomata: [C: 03+1] "Thanks! FWIW we should get rid of WikimediaDefaults.WIKIMEDIA_HTTP_CLIENT one day anyway, eh?" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/894642 (https://phabricator.wikimedia.org/T330236) (owner: 10DCausse) [15:29:13] (03PS1) 10Joal: Update oozie webrequest job adding test-cluster version [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894686 [15:37:01] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Jelto) [15:38:52] (03CR) 10Ottomata: [C: 03+2] Update oozie webrequest job adding test-cluster version [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894686 (owner: 10Joal) [15:38:54] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Update oozie webrequest job adding test-cluster version [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894686 (owner: 10Joal) [15:42:20] (03CR) 10DCausse: ProduceCanaryEvents: set a timeout on the http client (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/894642 (https://phabricator.wikimedia.org/T330236) (owner: 10DCausse) [15:51:17] btullis: is the fs age critical related to to namenode failover? [16:10:41] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate ores_predictions.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329876 (10pfischer) a:03pfischer [16:10:53] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate transfer_to_es.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329881 (10pfischer) a:03pfischer [16:14:27] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate rdf_streaming_updater_reconcile.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329879 (10pfischer) [16:17:29] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate transfer_to_es.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329881 (10pfischer) Will be done as part of T329876 [16:28:16] (03Abandoned) 10Jennifer Ebe: T329854-Airflow]-Migrate-mediacounts-archive-Oozie-job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/893816 (owner: 10Jennifer Ebe) [16:32:52] 10Data-Engineering, 10Advanced-Search, 10All-and-every-Wikisource, 10ArticlePlaceholder, and 73 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10MPhamWMF) [16:34:21] (03CR) 10Ottomata: [C: 03+1] ProduceCanaryEvents: set a timeout on the http client (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/894642 (https://phabricator.wikimedia.org/T330236) (owner: 10DCausse) [16:37:30] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops: k8s deployment-charts mesh module should allow use of mesh without public_port Service - https://phabricator.wikimedia.org/T326252 (10akosiaris) I guess we can close this one? [16:41:27] ottomata: Yes, apologies for the delay in answering. It's because we're running with the namenode on an-master1002. Whichever is the standby node creates the backup, but Icinga is hard-coded to check an-master1002. [16:43:07] ACKNOWLEDGEMENT - Hadoop HDFS Namenode FSImage Age on an-master1002 is CRITICAL: FILE_AGE CRITICAL: /srv/hadoop/name/current/VERSION is 12688 seconds old and 217 bytes Btullis This is not a real problem. It is caused by the fact that we have failed over to the standby namenodes in preparation for T329073 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [16:43:54] ottomata: yes it is linked, the check is only running on the standby puppet role node and doesn't take in account the real state of the namenode (if it is active or standby) while the file checked is only updated on the namenode in standby state [16:43:54] We should update that check relying on jmx metrics we should be able to check fsimage age and namenode in standby state (will create request for it) [16:44:36] nfraison: Great! [16:44:49] btullis: sry miss your messages while I was writing mine/checking how we do things [16:48:33] nfraison: All good. Two answers are better than one. Mine didn't suggest a fix for it :-) [16:51:50] (HdfsCorruptBlocks) resolved: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [16:51:50] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [16:55:20] 10Data-Engineering-Planning, 10Data Pipelines, 10Patch-For-Review: [Iceberg] Update Refine Sanitize to insert into Iceberg tables - https://phabricator.wikimedia.org/T311739 (10JAllemandou) Problem statement for this use-case here: https://docs.google.com/document/d/1HVO4m8JG5mrYX9ltdvJdt8N3QVscspdNVbpUY2kKt... [17:15:37] (03PS1) 10Kosta Harlan: homepagevisit: Add new referer routes [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/894699 (https://phabricator.wikimedia.org/T328288) [17:24:22] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10herron) [17:31:50] (HdfsCorruptBlocks) resolved: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [17:32:09] Hey ottomata - how is it going on the test cluster? [17:33:16] (03PS2) 10Jennifer Ebe: Create_mediacounts_archive_hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/891621 [17:35:23] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ssingh) [18:39:55] (03CR) 10DCausse: ProduceCanaryEvents: set a timeout on the http client (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/894642 (https://phabricator.wikimedia.org/T330236) (owner: 10DCausse) [18:50:51] 10Data-Engineering, 10Data Pipelines: Install conda-analytics on Airflow servers - https://phabricator.wikimedia.org/T331345 (10xcollazo) [18:55:10] 10Data-Engineering, 10Equity-Landscape: Data structuring guidance request - https://phabricator.wikimedia.org/T287402 (10JAnstee_WMF) 05Open→03Resolved [19:14:42] 10Data-Engineering, 10Data Pipelines: Install conda-analytics on Airflow servers - https://phabricator.wikimedia.org/T331345 (10Ottomata) @BTullis @mforns is there any reason we should not do this? I think it's fine and probably the easiest solution. Other alternatives: - install analytics-platform-eng user... [19:20:51] (03CR) 10Ottomata: [C: 03+1] ProduceCanaryEvents: set a timeout on the http client (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/894642 (https://phabricator.wikimedia.org/T330236) (owner: 10DCausse) [19:21:59] 10Data-Engineering, 10Equity-Landscape: Load country data - https://phabricator.wikimedia.org/T310712 (10mpopov) @ntsako @JAnstee_WMF: Where can I find the final version of the dataset? How can this task be resolved when the dependency T318850 is still open? [19:35:20] (03CR) 10Milimetric: [C: 03+2] "Looks good to me, +2-ing and letting you merge when you're ready." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/889512 (owner: 10Nmaphophe) [19:39:11] 10Analytics, 10Data-Engineering-Icebox, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), and 3 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10apaskulin) [20:02:05] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10herron) [20:08:34] (03PS2) 10Kosta Harlan: homepagevisit: Add new referer routes [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/894699 (https://phabricator.wikimedia.org/T322435) [20:14:28] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:17:08] (03PS1) 10Joal: Fix oozie/webrequest/dataset_raw.xml [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894726 [20:17:47] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging to fix test cluster" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894726 (owner: 10Joal) [20:18:08] ottomata: I have fixed and restarted the test-cluster webrequest job [20:36:04] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:01:04] joal: okay i will look, thank you! [21:08:34] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:41:10] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:05:52] and btullis nfraison, analytitcs test cluster has an-test-master1002 as active namenode, is that intended too? [22:06:13] was that recently switched? [22:06:13] i see some newly scheduled (since joal's fix) oozie jobs that failed there. [22:08:46] 10Data-Engineering, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), 10Epic, and 2 others: AQS 2.0: Geo Analytics Service - https://phabricator.wikimedia.org/T288305 (10BPirkle) [22:09:59] joal: i see the oozie job running, itt is catching up [22:10:06] i'll check the aqs_hourly job tomorrow [22:13:32] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:35:14] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:12:44] !log deployed airflow analytics to unbreak druid-load-edit-hourly [23:12:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [23:17:22] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate popularity_score.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329877 (10EBernhardson) a:03EBernhardson [23:19:53] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=786ee8c7-4753-4e2d-96f9-8b55b691ff09) set by bking@cumin2002 for... [23:20:58] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f9f1bd07-4af1-41e3-82b7-3ab0f2ff8672) set by bking@cumin2002 for... [23:22:23] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10bking) [23:25:14] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10RKemper)