[04:12:06] 10Analytics, 06Data-Engineering-Icebox, 10CX-analytics, 10Language-analytics, 07Technical-Debt: Special:ContentTranslationStats is slow and getting crowded - https://phabricator.wikimedia.org/T325790#9582373 (10santhosh) There is a feature in superset where we can just embed any dashboards in any web pag... [06:35:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [08:16:10] 06Data-Engineering, 06Data-Persistence, 10 2024.02.12 - 2024.03.03: Migrate dbstore* hosts to 10.6 - https://phabricator.wikimedia.org/T356961#9582590 (10Marostegui) @BTullis can you upgrade dbstore1007 soon please? That might block our upgrades in s2, s3 and s4 in the near future (not blocked as of today) I... [08:36:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [09:18:36] (03PS6) 10Joal: Add skip-list option to import_mediawiki_dumps [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1006957 (https://phabricator.wikimedia.org/T357859) [09:18:58] (03CR) 10Joal: Add skip-list option to import_mediawiki_dumps (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1006957 (https://phabricator.wikimedia.org/T357859) (owner: 10Joal) [09:19:40] (03PS7) 10Joal: Add skip-list option to import_mediawiki_dumps [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1006957 (https://phabricator.wikimedia.org/T357859) [09:24:50] (03CR) 10Joal: [V: 03+2 C: 03+2] "Manually tested with various parameters. Merging for deploy." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1006957 (https://phabricator.wikimedia.org/T357859) (owner: 10Joal) [09:25:16] (03CR) 10Joal: [V: 03+2 C: 03+2] Add skip-list option to import_mediawiki_dumps (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1006957 (https://phabricator.wikimedia.org/T357859) (owner: 10Joal) [09:28:47] !log Deploying Refinery for T357859 [09:28:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:28:50] T357859: Skip Wikidata when loading XML dumps to the Data Lake - https://phabricator.wikimedia.org/T357859 [09:48:38] !log Deploying refinery onto HDFS [09:48:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:46:15] btullis, brouberol: we can talk about https://gerrit.wikimedia.org/r/c/operations/puppet/+/1007301 here if you wish [10:48:04] joal: Running a pcc of that patch against an-launcher1002 just to be on the safe side, but it looks good. [10:48:31] ack btullis - thanks for this :) [10:50:32] joal: Some unexpected whitespace changes in https://puppet-compiler.wmflabs.org/output/1007301/1503/an-launcher1002.eqiad.wmnet/index.html but nothing big. [10:51:58] hum, probably due to my if in the template - I don't know how to make it better though :( [10:53:47] I think it is probably just a `<%-` in here, but not 100% sure. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1007301/2/modules/profile/templates/analytics/refinery/job/refinery-import-mediawiki-dumps.sh.erb#12 [10:54:24] I can try uploading a new patchset if you like? [11:04:16] btullis: please feel free! [11:04:51] joal: Will do. [11:08:15] thanks again btullis [11:08:29] !log reimaging dbstore1007 to bookworm for T356961 [11:08:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:08:32] T356961: Migrate dbstore* hosts to 10.6 - https://phabricator.wikimedia.org/T356961 [11:08:50] joal: A pleasure. Oh, I have a question for you. [11:08:57] pleaseb [11:08:59] please btullis [11:10:11] It relates to T345771 and the `refinery-sqoop-whole-mediawiki` [11:10:12] T345771: Adapt Sqoop to pagelinks schema change - https://phabricator.wikimedia.org/T345771 [11:10:34] sure [11:10:35] We still have a failed systemd job on an-launcher1002 since the beginning of the month. [11:10:58] https://usercontent.irccloud-cdn.com/file/lJrYYwGz/image.png [11:11:05] 06Data-Engineering, 06Data-Persistence, 10 2024.02.12 - 2024.03.03: Migrate dbstore* hosts to 10.6 - https://phabricator.wikimedia.org/T356961#9582993 (10Marostegui) Don't forget to run mysql_upgrade for each section (mysql_upgrade -S $PATH_TO_SOCKET_LOCATIOn) [11:11:14] Ah right - this will not fix itself - we should acknowledge it [11:11:51] Since this timer will not run again until the beginning of next month. Are we OK simply to do a `systemctl reset-failed` on it, or is there anything else that we should do? [11:12:21] I realise I've left it a bit late in the month to ask :-) [11:13:22] 06Data-Engineering, 06Data-Persistence, 10 2024.02.12 - 2024.03.03: Migrate dbstore* hosts to 10.6 - https://phabricator.wikimedia.org/T356961#9582994 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host dbstore1007.eqiad.wmnet with OS bookworm [11:17:06] btullis: I think a reset-failed is enough [11:17:18] joal: Great, thanks. [11:17:35] btullis: The timer will fail anew next month, we know it, but there is nothing we've done yet to prevent this. I need to care that soon. [11:18:42] joal: OK, thanks for the advance notice. Just so you know, there will be... [11:19:01] 1) an email from the systemd job itself that goes to data-eningeering-alerts@lists.wikimedia.org containing the stderr of the job [11:20:03] 2) regular emails about the failed systemd job that go to data-platform-alerts@wikimedia.org until someone does a `systemctl reset-failed` on the failed unit. [11:23:44] ack btullis - Do you think I should subscribe to data-platform-alerts? [11:27:59] joal: I think it's probably a good idea, but as long as it doesn't make you feel swamped. The idea of this list is that it should be the default destination for alerts that are intended for SREs first. A way of reducing the cognitive load on whoever is on Data Engineering Ops week. [11:28:25] They also go to the #wikimedia-data-platform-alerts IRC channel, so that's another option for you? [11:28:27] Yeah the split makes a lot of sense :) Would you please include me in the new group? [11:28:35] Will do. [11:28:39] I dela better with emails than IRC :) [11:28:49] thanks btullis :) [11:29:05] Cool. I've deployed your patch too. [11:29:23] Yay <3 [11:51:32] 06Data-Engineering, 06Data Products, 10Observability-Logging, 06Traffic, 13Patch-For-Review: Install new Benthos instance on cp hosts - https://phabricator.wikimedia.org/T358109#9583090 (10Fabfur) [11:52:21] 06Data-Engineering, 06Data-Persistence, 10 2024.02.12 - 2024.03.03: Migrate dbstore* hosts to 10.6 - https://phabricator.wikimedia.org/T356961#9583091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host dbstore1007.eqiad.wmnet with OS bookworm completed: - dbstor... [12:01:55] 06Data-Engineering, 06Data-Persistence, 10 2024.02.12 - 2024.03.03: Migrate dbstore* hosts to 10.6 - https://phabricator.wikimedia.org/T356961#9583115 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host dbstore1007.eqiad.wmnet with OS bullseye [12:09:10] 06Data-Engineering, 06Data-Platform-SRE, 07Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710#9583140 (10BTullis) [12:10:00] (03CR) 10Gmodena: Extract RefineSingleApp code from Refine (0314 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1003745 (https://phabricator.wikimedia.org/T356363) (owner: 10Joal) [12:20:46] 06Data-Engineering, 10 2024.02.12 - 2024.03.03, 13Patch-For-Review: Monitor the availability of the superset deployments - https://phabricator.wikimedia.org/T356484#9583226 (10BTullis) In addition to the `kube_deployment_status_replicas_available` metric, it might be quite a good idea to use one or two Prome... [12:25:46] 06Data-Engineering, 10 2024.02.12 - 2024.03.03, 13Patch-For-Review: Monitor the availability of the superset deployments - https://phabricator.wikimedia.org/T356484#9583266 (10BTullis) Hmm. It seems that we already have some http blackbox probes defined here in the service catalog: https://github.com/wikimed... [12:25:48] joal: sorry, I'm only back now [12:25:59] I see btullis was more reactive, thanks! [12:36:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [12:37:10] np brouberol :) [12:37:22] 06Data-Engineering, 06Data-Persistence, 10 2024.02.12 - 2024.03.03: Migrate dbstore* hosts to 10.6 - https://phabricator.wikimedia.org/T356961#9583317 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host dbstore1007.eqiad.wmnet with OS bullseye completed: - dbstor... [12:40:02] 06Data-Engineering, 10Data Pipelines, 06SRE, 06Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227#9583341 (10mpopov) Okay, so it's been a few years now and this bug still exists and impacts the quality of our analyses substantially (especially for Future... [12:50:55] 06Data-Engineering, 10WMF-JobQueue, 06serviceops, 13Patch-For-Review, and 3 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745#9583374 (10gmodena) Hey @Clement_Goubert , I was on PTO last week and trying to piece together wh... [13:06:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:48:48] btullis: I confirm I receive data-platform emails - thank you :) [14:28:43] 06Data-Engineering, 06Data-Persistence, 102024.02.12 - 2024.03.03: Migrate dbstore* hosts to 10.6 - https://phabricator.wikimedia.org/T356961#9583755 (10BTullis) [14:31:22] 06Data-Engineering, 06Data-Persistence, 102024.02.12 - 2024.03.03: Migrate dbstore* hosts to 10.6 - https://phabricator.wikimedia.org/T356961#9583768 (10Marostegui) >>! In T356961#9582993, @Marostegui wrote: > Don't forget to run mysql_upgrade for each section (mysql_upgrade -S $PATH_TO_SOCKET_LOCATIOn) I h... [14:31:53] 06Data-Engineering, 06Data-Persistence, 102024.02.12 - 2024.03.03: Migrate dbstore* hosts to 10.6 - https://phabricator.wikimedia.org/T356961#9583772 (10BTullis) All done now. Upgraded to bookworm. I had to reimage twice, because I didn't shut down each mariadb section cleanly before the first attempt at a r... [14:32:48] 06Data-Engineering, 06Data-Persistence, 102024.02.12 - 2024.03.03: Migrate dbstore* hosts to 10.6 - https://phabricator.wikimedia.org/T356961#9583781 (10BTullis) >>! In T356961#9583768, @Marostegui wrote: >>>! In T356961#9582993, @Marostegui wrote: >> Don't forget to run mysql_upgrade for each section (mysql... [14:33:34] 06Data-Engineering, 06Data-Persistence, 102024.02.12 - 2024.03.03: Migrate dbstore* hosts to 10.6 - https://phabricator.wikimedia.org/T356961#9583784 (10Marostegui) 05Open→03Resolved Thanks for getting this done [15:01:57] 06Data-Engineering, 06Data-Platform-SRE: Update the From: addresses of all email from DPE pipelines so that they use routable addresses - https://phabricator.wikimedia.org/T358675 (10BTullis) [15:04:09] 14Analytics, 06Data-Engineering-Icebox, 10CX-analytics, 10Language-analytics, and 2 others: Special:ContentTranslationStats is slow and getting crowded - https://phabricator.wikimedia.org/T325790#9583907 (10Pginer-WMF) [15:26:57] 06Data-Engineering, 10CX-cxserver, 10Citoid, 06Content-Transform-Team-WIP, and 11 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118#9584108 (10Jdforrester-WMF) [15:33:02] 06Data-Engineering, 10CX-cxserver, 10Citoid, 06Content-Transform-Team-WIP, and 11 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118#9584162 (10Jdforrester-WMF) [15:56:37] 06Data-Engineering, 10CX-cxserver, 10Citoid, 06Content-Transform-Team-WIP, and 11 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118#9584263 (10Jdforrester-WMF) [16:00:21] 06Data-Engineering, 10CX-cxserver, 10Citoid, 06Content-Transform-Team-WIP, and 11 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118#9584257 (10Jdforrester-WMF) [16:16:10] 06Data-Engineering, 102024.03.04 - 2024.03.24: Update the From: addresses of all email from DPE pipelines so that they use routable addresses - https://phabricator.wikimedia.org/T358675#9584319 (10BTullis) [17:28:01] btullis: I'm gonna need your help again for some puppet please [17:28:12] btullis: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1007407 [17:28:17] for when you have a minute [17:28:33] It'd be awesome if we could have it merged/deployed before Friday (1st of month) [17:46:37] joal: looking any minute now. [17:46:43] thanks btullis [18:06:50] (03PS1) 10Joal: [WIP] Fix sqoop for pagelinks normalization migration [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1007413 (https://phabricator.wikimedia.org/T345771) [18:08:33] joal: That's merged and deployed. Running puppet on an-launcher1002 now to pull the changes. [18:08:51] Many thanks for this btullis <3 [18:09:30] A pleasure. [18:09:55] (03PS2) 10Joal: [WIP] Fix sqoop for pagelinks normalization migration [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1007413 (https://phabricator.wikimedia.org/T345771) [18:26:24] 06Data-Engineering, 10Foundational Technology Requests, 102024.03.04 - 2024.03.24: Enable the Marketing Campaigns Reporting plugin for matomo - https://phabricator.wikimedia.org/T319013#9584969 (10BTullis) We understand that the security review is under way and there may be a new campaign for which this Mato... [18:29:40] 06Data-Engineering, 102024.03.04 - 2024.03.24: Deprecate Hue and stop the services - https://phabricator.wikimedia.org/T341895#9584988 (10BTullis) [18:33:43] 06Data-Engineering, 06Data-Platform-SRE, 06SRE, 06serviceops, 10Event-Platform: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058#9585000 (10BTullis) I believe that this ticket will be invalidated by the approach that that has tested and agreed upon in {T331894}. There... [18:42:06] 06Data-Engineering, 06Data-Platform-SRE, 07Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710#9585047 (10BTullis) [19:11:15] (03PS3) 10Joal: Fix sqoop for pagelinks normalization migration [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1007413 (https://phabricator.wikimedia.org/T345771) [19:23:09] 06Data-Engineering, 14Data-Engineering-Kanban: Pageview Data loss due to wrong version of package installed on some varnishkafka instances - https://phabricator.wikimedia.org/T300164#9585229 (10BCornwall) [19:23:28] 06Data-Engineering-Radar, 06Traffic: Lock-in Varnish and VarnishKafka versions - https://phabricator.wikimedia.org/T304617#9585226 (10BCornwall) 05Open→03In progress a:03CDobbins [19:31:05] (03PS4) 10Joal: Fix sqoop for pagelinks normalization migration [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1007413 (https://phabricator.wikimedia.org/T345771) [19:32:28] (03PS5) 10Joal: Fix sqoop for pagelinks normalization migration [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1007413 (https://phabricator.wikimedia.org/T345771) [19:38:53] (03PS6) 10Joal: Fix sqoop for pagelinks normalization migration [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1007413 (https://phabricator.wikimedia.org/T345771) [19:45:55] (03PS7) 10Joal: Fix sqoop for pagelinks normalization migration [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1007413 (https://phabricator.wikimedia.org/T345771) [19:52:40] (03PS8) 10Joal: Fix sqoop for pagelinks normalization migration [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1007413 (https://phabricator.wikimedia.org/T345771) [20:22:05] (03PS9) 10Joal: Fix sqoop for pagelinks normalization migration [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1007413 (https://phabricator.wikimedia.org/T345771) [20:41:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [21:49:14] (03PS10) 10Joal: Fix sqoop for pagelinks normalization migration [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1007413 (https://phabricator.wikimedia.org/T345771) [21:51:14] \o/ this time it's working :) [21:51:44] mforns: If you have a minute, would you mind reviewing the patch above please? I'd like to merge and deploy it tomorrow [22:01:21] (03CR) 10Joal: [V: 03+2] "Tested on cluster" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1007413 (https://phabricator.wikimedia.org/T345771) (owner: 10Joal) [22:16:36] 06Data-Engineering, 06Data Products: Modify ClickStreamBuilder pipeline to cope with pagelinks schema changes - https://phabricator.wikimedia.org/T355588#9585684 (10xcollazo) @lbowmaker `clickstream_monthly_dag.py` sensors typically take till the 3rd of the month to succeed, so we have about 4 days till this b... [22:23:33] (03PS2) 10Aleksandar Mastilovic: Add HQL query files for the "pingback" report [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1006970 [22:40:03] 06Data-Engineering, 06Data-Platform-SRE, 10Scap: analytics/refinery: Stop using git-fat - https://phabricator.wikimedia.org/T328472#9585732 (10dancy) Hi folks. scaps git-lfs support has been fixed so we can migrate analytics/refinery to git-lfs. To enable LFS for this repo in Gerrit, I need to know what yo... [22:56:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage