[08:06:08] 10Data-Engineering-Planning, 10Data Pipelines, 10Foundational Technology Requests, 10Traffic, and 2 others: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10elukey) 05Open→03Resolved a:03elukey Closing the task since nobody opposed to my earl... [08:20:18] (03CR) 10Joal: [C: 03+2] "Merging for deploy this week" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/856530 (https://phabricator.wikimedia.org/T321168) (owner: 10Aqu) [08:29:27] (03Merged) 10jenkins-bot: Put wikihadoop into refinery/source [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/856530 (https://phabricator.wikimedia.org/T321168) (owner: 10Aqu) [08:31:54] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): Flink SQL queries should access Kafka topics from a Catalog - https://phabricator.wikimedia.org/T322022 (10tchin) [[ https://gitlab.wikimedia.org/tchin/flink-wmf-event-catalog | Here's the working code so far]], sans the stuff I talk about... [09:32:00] (03CR) 10Joal: "Again many comments :S Let's talk about them in our 1-1" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/852315 (https://phabricator.wikimedia.org/T321168) (owner: 10Aqu) [09:36:33] 10Data-Engineering, 10Equity-Landscape: Load country data - https://phabricator.wikimedia.org/T310712 (10ntsako) a:05JAnstee_WMF→03ntsako [09:41:59] (03PS2) 10Nmaphophe: GDI Equity Landscape Tables/Scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858370 [09:44:36] (03PS3) 10Nmaphophe: GDI Equity Landscape Tables/Scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858370 [09:47:14] (03PS4) 10Nmaphophe: GDI Equity Landscape Tables/Scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858370 [09:52:53] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for next deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/861365 (owner: 10Volans) [10:50:31] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05): Deploy Mediawiki Stream Enrichment on an-launcher1002. - https://phabricator.wikimedia.org/T323914 (10gmodena) The cluster and mediawiki stream enrichment job are running at https://yarn.wikimedia.org/proxy/application_1663082229270_434209/#/task-m... [11:04:28] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Upgrade Turnilo - https://phabricator.wikimedia.org/T301990 (10EChetty) [11:04:31] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Fix turnilo after upgrade - https://phabricator.wikimedia.org/T308778 (10EChetty) 05Open→03Resolved [13:48:53] 10Data-Engineering-Planning, 10Data Pipelines, 10Shared-Data-Infrastructure: [Iceberg] Debianize and install iceberg support for Spark, Presto, and optionally Hive - https://phabricator.wikimedia.org/T311738 (10EChetty) Hive is out of Scope for this task [14:13:20] 10Data-Engineering-Planning, 10Data Pipelines, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): [Iceberg] Debianize and install iceberg support for Spark, Presto, and optionally Hive - https://phabricator.wikimedia.org/T311738 (10EChetty) [14:14:13] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10EChetty) [14:14:28] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10EChetty) [14:16:56] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 (10EChetty) [14:17:04] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 (10EChetty) [14:19:47] 10Data-Engineering-Planning, 10Data Pipelines, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): [Iceberg] Debianize and install iceberg support for Spark, Presto, and optionally Hive - https://phabricator.wikimedia.org/T311738 (10EChetty) [14:47:30] a-team: I am about to shut down 7 an-worker nodes in order for them to receive new RAID controller batteries. Re: T318659 [14:47:31] T318659: Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 [14:47:43] nice [14:48:37] One of these seven nodes is a journalnode, but four of the five journalnodes will remain up and running, so I do not anticipate a problem with this. Still, I'll shut down an-worker1090 first and check the things. [14:49:08] Ref: an-worker1090 [14:49:18] https://github.com/wikimedia/puppet/blob/production/hieradata/common.yaml#L911-L916 [14:55:55] !log shutting down an-worker1090 [14:55:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:58:42] !log shutting down an-worker1079 [14:58:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:00:54] !log shutting down an-worker1083 [15:00:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:02:00] !log shutting down an-worker1085 [15:02:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:03:33] !log shutting down an-worker1089 [15:03:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:04:54] !log shutting down an-worker1093 [15:04:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:05:37] PROBLEM - Host an-worker1089 is DOWN: PING CRITICAL - Packet loss = 100% [15:09:03] ACKNOWLEDGEMENT - Host an-worker1079 is DOWN: PING CRITICAL - Packet loss = 100% Btullis Intentional downtime for RAID battery replacement [15:09:03] ACKNOWLEDGEMENT - Host an-worker1083 is DOWN: PING CRITICAL - Packet loss = 100% Btullis Intentional downtime for RAID battery replacement [15:09:03] ACKNOWLEDGEMENT - Host an-worker1085 is DOWN: PING CRITICAL - Packet loss = 100% Btullis Intentional downtime for RAID battery replacement [15:09:04] ACKNOWLEDGEMENT - Host an-worker1089 is DOWN: PING CRITICAL - Packet loss = 100% Btullis Intentional downtime for RAID battery replacement [15:09:05] ACKNOWLEDGEMENT - Host an-worker1090 is DOWN: PING CRITICAL - Packet loss = 100% Btullis Intentional downtime for RAID battery replacement [15:09:06] ACKNOWLEDGEMENT - Host an-worker1093 is DOWN: PING CRITICAL - Packet loss = 100% Btullis Intentional downtime for RAID battery replacement [15:09:07] ACKNOWLEDGEMENT - Host an-worker1094 is DOWN: PING CRITICAL - Packet loss = 100% Btullis Intentional downtime for RAID battery replacement [15:19:50] (HdfsMissingBlocks) firing: HDFS missing blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_missing_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=40&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsMissingBlocks [15:20:07] ^looking. [15:26:11] PROBLEM - Host an-worker1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:26:29] This may mean that there are some files with all three replicas on these datanodes. I'm running `sudo -u hdfs kerberos-run-command hdfs hdfs fsck /|grep '0 live replica'` to try to find any such files. [15:44:21] PROBLEM - Host an-worker1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:44:24] 10Data-Engineering-Planning, 10Cassandra, 10Data Pipelines (Sprint 04), 10Patch-For-Review: Write dedicated cassandra authorization code to read password from file when loading - https://phabricator.wikimedia.org/T306895 (10Eevans) >>! In T306895#8408387, @JAllemandou wrote: > Thank you @BTullis and @Ottom... [15:44:39] RECOVERY - Host an-worker1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [15:51:56] 10Data-Engineering-Planning, 10Cassandra, 10Data Pipelines (Sprint 04), 10Patch-For-Review: Write dedicated cassandra authorization code to read password from file when loading - https://phabricator.wikimedia.org/T306895 (10BTullis) Thanks for reminding me @Eevans - Yes, it's this ticket: {T323692} [15:53:37] 10Data-Engineering-Planning, 10Cassandra, 10Data Pipelines (Sprint 04), 10Patch-For-Review: Write dedicated cassandra authorization code to read password from file when loading - https://phabricator.wikimedia.org/T306895 (10Eevans) >>! In T306895#8428977, @BTullis wrote: > Thanks for reminding me @Eevans -... [15:55:00] 10Data-Engineering-Planning, 10Cassandra: Create puppet defined type for adding/updating/deleting secrets or other small files on HDFS - https://phabricator.wikimedia.org/T323692 (10Eevans) [15:56:31] PROBLEM - Host an-worker1085.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:56:39] RECOVERY - Host an-worker1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [16:02:02] hmm, i'm getting missing blocks from hdfs, known issue? `hdfs dfs -cat /wmf/data/discovery/query_clicks/daily/year=2022/month=9/day=13/000011_0 > /dev/null` gives Could not obtain BP-1552854784-10.64.21.110-1405114489661:blk_2038833274_965173858 from any node: No live nodes contain current block Block locations: Dead nodes: . [16:02:33] PROBLEM - Host an-worker1083.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:02:37] RECOVERY - Host an-worker1085.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [16:02:40] oh, i see that mentioned an hour ago, it's known :) [16:03:38] ebernhardson: Ah, sorry. Yes I'm sorry, it's related to work currently being coordinated between myself and dc-ops. I should have scheduled the servers to be handled one at a time instead of seven at once. Should be fixed soon. [16:04:39] kk, i'll try it again in a couple hours. no big deal [16:06:07] Thanks. You should be able to see when it is resolved from this graph: https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&var-hadoop_cluster=analytics-hadoop&viewPanel=40&from=now-15m&to=now&refresh=5m [16:08:39] RECOVERY - Host an-worker1083.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [16:14:43] PROBLEM - Host an-worker1079.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:17:17] RECOVERY - Host an-worker1079.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [16:20:49] PROBLEM - Host an-worker1093.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:26:54] RECOVERY - Host an-worker1093.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [16:27:01] PROBLEM - Host an-worker1094.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:31:15] 10Data-Engineering-Planning: Bug/Incident Report [TEMPLATE] - https://phabricator.wikimedia.org/T320633 (10Aklapper) >>! In T320633#8322033, @EChetty wrote: > Is it still necessary to keep the tag even if it is used as example. This is specifically for motivating in T320648 @EChetty: Hi, I'm not sure I understa... [16:39:19] RECOVERY - Host an-worker1094.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [17:07:31] RECOVERY - Host an-worker1089 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [17:08:52] !log booted all of the an-worker nodes that had been switched off. [17:08:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:12:28] !log deploying refinery, then restarting druid webrequest daily and hourly loading oozie jobs [17:12:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:21:17] ebernhardson: Zero missing blocks. You are good to go again if you wish. Apologies for the inconvenience. [17:21:20] (HdfsMissingBlocks) resolved: HDFS missing blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_missing_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=40&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsMissingBlocks [17:22:42] thanks btullis for handling this [17:28:22] 10Data-Engineering-Planning: requesting Kerberos password for mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313316 (10Ottomata) 05In progress→03Resolved [17:46:01] btullis: thanks! [18:04:53] 10Data-Engineering-Planning: Check home/HDFS leftovers of eyener - https://phabricator.wikimedia.org/T316072 (10Ottomata) Hi @jrobell1, the following files are leftover in eyener's home directories on the stat boxes. Do you approve their removal? We can archive things that need to be kept, but we'd prefer to r... [18:07:13] 10Data-Engineering-Planning: Check home/HDFS leftovers of dpifke - https://phabricator.wikimedia.org/T315841 (10Ottomata) 05Open→03Resolved a:03Ottomata There are no leftover data files owned by dpifke in the analytics cluster. ` 18:06:14 [@an-launcher1002:/home/otto] $ sudo -u hdfs kerberos-run-command h... [18:08:29] 10Data-Engineering-Planning: Check home/HDFS leftovers of ejoseph - https://phabricator.wikimedia.org/T322182 (10Ottomata) 05Open→03Resolved a:03Ottomata There are no leftover data files owned by ejoseph in the analytics cluster. ` 18:06:23 [@an-launcher1002:/home/otto] $ sudo -u hdfs kerberos-run-command... [18:09:32] 10Data-Engineering-Planning: Check home/HDFS leftovers of faidon - https://phabricator.wikimedia.org/T322107 (10Ottomata) @mark, please approve for removal of the following files: ` ====== stat1004 ====== total 572 -rw-rw-r-- 1 2186 wikidev 602 Dec 19 2018 asn-rank.py -rw-rw-r-- 1 2186 wikidev 566132 Dec 19... [18:11:20] 10Data-Engineering-Planning: Check home/HDFS leftovers of bscarone - https://phabricator.wikimedia.org/T321542 (10Ottomata) Alright! Leaving this task open for now then. @Miriam, will bscarone be coming back to work, or will the files be used and owned by a new person? If the former, we can just wait until h... [18:14:52] 10Data-Engineering-Planning: Check home/HDFS leftovers of bmansurov - https://phabricator.wikimedia.org/T320367 (10Ottomata) @Miriam, please approve for removal of the following files and Hive tables. We can [[ https://wikitech.wikimedia.org/wiki/Data_Engineering/Ops_week#Archival_of_user_files | archive ]] the... [18:16:06] 10Data-Engineering-Planning: Check home/HDFS leftovers of nikafor - https://phabricator.wikimedia.org/T319268 (10Ottomata) 05Open→03Resolved a:03Ottomata There are no leftover data files owned by nikafor in the analytics cluster. nikafor's hdfs and regular homedirs have already been removed. [18:17:22] 10Data-Engineering-Planning: Check home/HDFS leftovers of jmads - https://phabricator.wikimedia.org/T319266 (10Ottomata) @Dendelele, please approve the removal of the following files: ` ====== stat1005 ====== total 8 drwxrwxr-x 5 24076 wikidev 4096 Aug 7 20:34 datasets drwxrwxr-x 6 24076 wikidev 4096 Aug 4 15... [18:18:36] 10Data-Engineering-Planning: Check home/HDFS leftovers of bscarone - https://phabricator.wikimedia.org/T321542 (10Miriam) @Ottomata thank you! bscarone will resume his work in February, so keeping the data in his home would be the best option if that's possible! [18:20:43] 10Data-Engineering-Planning: Check home/HDFS leftovers of bscarone - https://phabricator.wikimedia.org/T321542 (10Ottomata) 05Open→03Declined +1, sounds good. Let's decline this task then. We can reopen or make a new one if/when bscarone leaves again :) [18:23:20] 10Data-Engineering-Planning: Check home/HDFS leftovers of bmansurov - https://phabricator.wikimedia.org/T320367 (10Miriam) @Ottomata thanks for this! Can we please temporally archive these files and hive tables? We will evaluate what can be permanently removed once this stream of work resumes in February. Thank... [18:27:15] 10Data-Engineering-Planning: Check home/HDFS leftovers of bmansurov - https://phabricator.wikimedia.org/T320367 (10Ottomata) It might be easier to either leave these files in place and revisit this in February, or to have them moved under your ownership. When we archive, we zip everything up and put it in HDFS.... [19:52:09] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Remove Matplotlib as a Wmfdata-Python dependency - https://phabricator.wikimedia.org/T324053 (10nshahquinn-wmf) [20:01:10] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Remove Matplotlib as a Wmfdata-Python dependency - https://phabricator.wikimedia.org/T324053 (10nshahquinn-wmf) p:05Triage→03Low For the most part, the dependency doesn't matter. However, it does slightly expand the size of the `conda-analytics`... [20:36:55] 10Analytics-Kanban, 10Data-Engineering, 10Product-Analytics, 10SRE, 10Wmfdata-Python: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10nshahquinn-wmf) [20:37:29] 10Analytics-Kanban, 10Data-Engineering, 10Product-Analytics, 10SRE, 10Wmfdata-Python: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10nshahquinn-wmf) Updated the description to note: > In addition, analytics-mysql is not available on an-test-client... [20:42:45] 10Analytics-Radar, 10Data-Engineering-Planning, 10Pageviews-API, 10Tool-Pageviews: 429 Too Many Requests hit despite throttling to 100 req/sec - https://phabricator.wikimedia.org/T219857 (10VirginiaPoundstone) [20:49:08] 10Data-Engineering-Planning, 10API Platform, 10GraphQL, 10Pageviews-API: Responses on pageview API should be lighter - https://phabricator.wikimedia.org/T145935 (10VirginiaPoundstone) [20:49:11] 10Data-Engineering, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), 10Epic, and 2 others: AQS 2.0: Pageviews Service - https://phabricator.wikimedia.org/T288296 (10VirginiaPoundstone)