[01:34:55] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): EventStreamCatalog removes 'topic' table option if connector = upsert-kafka - https://phabricator.wikimedia.org/T330769 (10tchin) a:03tchin [02:15:07] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Flink EventStreamCatalog should add watermark - https://phabricator.wikimedia.org/T330441 (10tchin) a:03tchin [08:32:49] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [08:33:03] (03PS1) 10Joal: Add referer_data field to the pageview_actor table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/898679 (https://phabricator.wikimedia.org/T331898) [08:49:56] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Vgutierrez) [09:04:35] (03CR) 10Joal: [C: 03+1] "Nits in comments - code looks good" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu) [09:28:10] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10elukey) [09:31:33] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10Gehel) 05In progress→03Resolved [09:31:37] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): [Tracking] Migrate Search Airflow jobs to Airflow 2 and use shared supporting code from the data engineering Airflow - https://phabricator.wikimedia.org/T318414 (10Gehel) [09:34:36] (03CR) 10Joal: [C: 03+1] Migrate refine webrequest to Airflow (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu) [09:46:27] 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform Value Stream (Sprint 10): Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10Ottomata) > it's like the mediawiki/revision/score schema can be used by many streams e.... [10:00:41] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10elukey) [10:23:39] !log deploying airflow package version 2.5.1-py3.10-20230228 to stats hosts [10:23:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:26:02] btullis, ottomata o/ - any plan for https://phabricator.wikimedia.org/T296064 during the next weeks? [10:26:19] (I was reviewing old tasks and this one is getting its own age :D) [10:30:09] elukey: Yes, OK. We're having a sprint planning meeting today. I'll try to get it into this sprint, although there are plenty of other old tickets. I'm going to try to do this one in the next 7 days :-) T284150 [10:30:10] T284150: Refactor analytics-meta MariaDB layout to use an-mariadb100[12] - https://phabricator.wikimedia.org/T284150 [10:31:43] btullis: o/ nono I didn't mean in the next week, it is ok even in the next months :) [10:32:16] I can prep some code reviews if you want, I think that the main question mark is to see if there are clients not ported to the new ca bundle [10:32:37] in theory no, but not sure if we added more consumers to jumbo recently [10:37:07] elukey: ack, thanks. We also have 6 new servers to add to kafka-jumbo and 6 servers to decom, so that will take some time to do the reshuffling and rebalancing. [10:39:33] ah snap yes [10:41:35] elukey: I know that we (you) have added benthos, but that looks to be using the new bundle already, right? Then we have the current experiments with flink, I can check those. Not aware of any other new consumers, but thanks for the heads-up. [10:42:08] OK, I won't try to squeeze it into this sprint, but we have Q4 planning very soon and I'll make sure it goes on the agenda for that. [10:43:55] yes yes benthos is using the new bundle, all good from that side [10:44:00] good point about flink [10:46:58] ahem I am seeing port 9092 (plaintext) used for flink mw enrichment [10:47:10] buuuuu :) [10:52:17] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/895754 [10:53:49] Then linked to T331526 - So it looks to be very temporary use of 9092 - ottomata is already working on it :-) [10:53:49] T331526: eventutilities-python should support using Kafka TLS ports - https://phabricator.wikimedia.org/T331526 [10:57:19] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics-Platform-Planning (Metrics Platform Kanban): Value for performer.registration_dt should be a string, not an integer - https://phabricator.wikimedia.org/T331972 (10phuedx) [10:59:49] elukey: btullis re enrichment kafka tls: https://phabricator.wikimedia.org/T331526 [11:00:25] oh you found it oops! [11:00:41] ottomata: All good :-) [11:00:50] ottomata: ahahah I was joking, nice that the ca bundle is already covered in there, I think that we are good [11:04:24] I wonder if I could get an opinion please. I am looking at these anomaly detection emails that we have been receiving for the last few days. [11:05:43] They say lots of drops in traffic from Sudan, Somalia, Tajikistan, Burundi, Uzbekistan - but when I look at the anomaly detection dashboard for traffic distribution it also shows traffic from Singapore to have dropped off completely. https://superset.wikimedia.org/superset/dashboard/315/ [11:07:06] Ops week docs say that 'traffic team will investigate further': https://wikitech.wikimedia.org/wiki/Data_Engineering/Ops_week#Anomaly_detection_alarms but is there anything I should do, other than make sure they are aware of it? [11:11:33] IIRC not really, the data was needed by the Traffic team, in theory they should reach back if anything looks weird/suspicious etc.. (in the sense that it looks odd and not inline with other sources) [11:48:12] !log reran refine_event_sanitized_analytics_immediate for netflow year=2023/month=3/day=8/hour=6 [11:48:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:56:19] (03CR) 10Luke Bowmaker: Add referer_data field to the pageview_actor table (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/898679 (https://phabricator.wikimedia.org/T331898) (owner: 10Joal) [13:16:02] 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 4 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Marostegui) [13:56:01] 10Analytics-Radar, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Diffusion-Repository-Administrators, and 3 others: Archive analytics/wikistats - https://phabricator.wikimedia.org/T332004 (10hashar) [13:59:53] 10Data-Engineering, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), 10Epic, and 2 others: Create k8s deployment of AQS 2.0 - https://phabricator.wikimedia.org/T288661 (10VirginiaPoundstone) [14:00:48] 10Analytics-Radar, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Diffusion-Repository-Administrators, and 4 others: Archive analytics/wikistats - https://phabricator.wikimedia.org/T332004 (10hashar) [14:21:21] (03CR) 10Joal: Add referer_data field to the pageview_actor table (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/898679 (https://phabricator.wikimedia.org/T331898) (owner: 10Joal) [14:25:04] (03CR) 10Luke Bowmaker: Add referer_data field to the pageview_actor table (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/898679 (https://phabricator.wikimedia.org/T331898) (owner: 10Joal) [14:43:01] 10Data-Engineering, 10AQS2.0, 10API Platform (Sprint 06), 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10JArguello-WMF) [14:57:22] !log deploying ceph mon and mgr daemons to cephosd100[1-5] T328123 [14:57:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:57:25] T328123: Create puppet profiles for the new ceph cluster - https://phabricator.wikimedia.org/T328123 [15:01:23] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add referer_data field to the pageview_actor table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/898679 (https://phabricator.wikimedia.org/T331898) (owner: 10Joal) [15:04:37] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "(we're going to turn off this job right now)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/804429 (owner: 10Milimetric) [15:07:54] (03PS1) 10Milimetric: Remove officewiki job (was an experiment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/898778 [15:11:27] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "dropping this job (already removed from airflow and turned off in prod)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/898778 (owner: 10Milimetric) [15:15:08] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "I forgot to send these comments... 3 years ago... oops" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/612574 (https://phabricator.wikimedia.org/T255757) (owner: 10Fdans) [15:30:00] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10herron) [15:31:59] hey a-team: It looks like I've made stat1006 unresponsive by eating up all its memory. Sorry! If someone could nudge it back by either stopping all my R processes or something, that would be awesome [15:40:04] 10Analytics-Radar, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Diffusion-Repository-Administrators, and 4 others: Archive analytics/wikistats - https://phabricator.wikimedia.org/T332004 (10Milimetric) sorry we left it open so long. I just have to check with Nemo, I'll reply back within a few days. [15:40:33] Nettrom: I can have a look at that for you. [15:43:01] Nettrom: Oh, it looks like the oom-killer already did this. The host is back under control now. [15:43:05] https://www.irccloud.com/pastebin/8SQrTVEN/ [15:44:04] btullis: thanks for checking on that, I'm happy to hear that the oom-killer took care of it! [15:44:33] nice [15:47:09] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate glent_weekly.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329872 (10pfischer) a:03pfischer [15:48:37] btullis: out of curiosity shouldn't Nettrom's memory gobble show up as a spike in the memory utilization on https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=eqiad&var-cluster=analytics&var-instance=stat1006&from=now-1h&to=now? [15:53:59] bearloga: I think if you look at this, you can see that whatever happened it was brutal enough to knock out the communication to the prometheus node exporter. There are big gaps in the metrics, so the numbers won't be available on these graphs that show summary data across lots of hosts. [15:53:59] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=stat1006&var-datasource=thanos&var-cluster=analytics&from=now-1h&to=now [15:55:02] whoa!!! that's fascinating – thanks! [15:55:30] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Make YARN web interface work with both primary and standby resourcemanager - https://phabricator.wikimedia.org/T331448 (10JArguello-WMF) [15:57:12] I'm afraid I'm in the middle of something, so I haven't gone further back through the logs yet. [15:58:08] I suspect that it might have started trying to swap to disk and that disk I/O saturated the server. [15:58:55] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10JArguello-WMF) 05In progress→03Resolved [15:58:57] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10JArguello-WMF) [15:59:06] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 (10JArguello-WMF) 05Open→03Resolved [15:59:09] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10JArguello-WMF) [15:59:12] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Investigate slownesses on an-worker1132 - https://phabricator.wikimedia.org/T330979 (10JArguello-WMF) 05Open→03Resolved [16:04:25] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Create puppet profiles for the new ceph cluster - https://phabricator.wikimedia.org/T328123 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d004b3da-f6b2-44bc-994d-9e4ff6dc6413) set by btullis@cumin1001... [16:04:58] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10BTullis) [16:05:00] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Create puppet profiles for the new ceph cluster - https://phabricator.wikimedia.org/T328123 (10BTullis) 05Open→03Resolved [16:07:26] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Deploy ceph mon processes to data-engineering cluster - https://phabricator.wikimedia.org/T330149 (10BTullis) p:05Triage→03High We have deployed the profiles in {T328123} so this in in progress while we bootstrap the c... [16:12:06] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Add optional TLS encryption to the druid-public-broker - https://phabricator.wikimedia.org/T331631 (10BTullis) [16:18:26] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate glent_weekly.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329872 (10pfischer) a:05pfischer→03EBernhardson [16:24:07] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics-Platform-Planning (Metrics Platform Kanban), 10Patch-For-Review: Value for performer.registration_dt should be a string, not an integer - https://phabricator.wikimedia.org/T331972 (10phuedx) [16:24:31] (03Abandoned) 10Sharvaniharan: Remove android.image_recommendation_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708189 (owner: 10Sharvaniharan) [17:13:19] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 (10DAbad) **Mar 14, 2023 Sprint Planning Notes:** - Recommend keeping in next sprint so we can keep an eye on this [17:17:13] 10Data-Engineering-Planning, 10Data-Catalog, 10Shared-Data-Infrastructure: Datahub user records are not being created after login - https://phabricator.wikimedia.org/T327884 (10JArguello-WMF) [17:27:25] 10Data-Engineering-Planning, 10Data Pipelines, 10Infrastructure-Foundations, 10Shared-Data-Infrastructure: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10JArguello-WMF) [17:27:58] 10Data-Engineering-Planning, 10Data Pipelines, 10Infrastructure-Foundations, 10Shared-Data-Infrastructure: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10JArguello-WMF) [17:28:09] 10Data-Engineering-Planning, 10Data Pipelines, 10Infrastructure-Foundations, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10JArguello-WMF) [17:52:09] 10Data-Engineering, 10Advanced-Search, 10All-and-every-Wikisource, 10ArticlePlaceholder, and 72 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Tchanders) Thanks - untagging IPInfo since the value is different from the default. [17:52:31] 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 4 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Izno) [17:53:38] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Refactor analytics-meta MariaDB layout to use an-mariadb100[12] - https://phabricator.wikimedia.org/T284150 (10JArguello-WMF) p:05Medium→03High [17:53:58] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Refactor analytics-meta MariaDB layout to use an-mariadb100[12] - https://phabricator.wikimedia.org/T284150 (10JArguello-WMF) [18:00:37] 10Data-Engineering-Planning, 10Data Pipelines, 10Infrastructure-Foundations, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10JArguello-WMF) p:05Medium→03Low [18:01:40] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10JArguello-WMF) p:05Triage→03Medium [18:26:13] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10Marostegui) I would prefer if this is given a bit of a higher priority. This task has been here for a year and we are no longer providing updates to... [18:30:52] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10BTullis) p:05Medium→03High Understood. Thanks @Marostegui - Elevated the priority as requested. This is a three week sprint for our team so I'm... [18:38:40] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10Marostegui) Thanks - I'm happy to help with the process. Basically it's just to double check the host is in the right partman recipe so /srv doesn't... [19:10:11] 10Analytics-Radar, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Diffusion-Repository-Administrators, and 4 others: Archive analytics/wikistats - https://phabricator.wikimedia.org/T332004 (10Nemo_bis) I actually have some patches from 2014 which are still relevant and I wish were merged. Like https:/... [19:19:11] 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (sprint 10): 2 additional new wikis - https://phabricator.wikimedia.org/T332070 (10Milimetric) [19:28:39] (03PS4) 10Jennifer Ebe: T330206 - Create Mediacounts Load Hourly HQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/896334 [19:57:01] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Patch-For-Review: Use new PageUndeleteComplete hook to emit mediawiki.page_change undelete event - https://phabricator.wikimedia.org/T328308 (10OwenRB) @Ottomata I thought I'd take a look at this as a related follow on from the previous change. Ca... [20:09:09] (03PS5) 10Jennifer Ebe: T330206 - Create Mediacounts Load Hourly HQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/896334 [20:19:55] (03PS1) 10Mazevedo: Add new unified mobile apps schema for Session [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/898851 (https://phabricator.wikimedia.org/T331481)