[00:05:19] (03CR) 10Mforns: Add query to load MediaWiki snapshot to Cassandra AQS config table (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/989558 (https://phabricator.wikimedia.org/T352948) (owner: 10Mforns) [00:06:10] (03PS10) 10Mforns: Add query to load MediaWiki snapshot to Cassandra AQS config table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/989558 (https://phabricator.wikimedia.org/T352948) [00:06:49] (03CR) 10Mforns: [V: 03+2 C: 03+2] Add query to load MediaWiki snapshot to Cassandra AQS config table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/989558 (https://phabricator.wikimedia.org/T352948) (owner: 10Mforns) [01:35:47] (SystemdUnitFailed) firing: (13) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:53:04] 10Data-Engineering, 10Movement-Insights: Conda-Analytics packages incompatible with latest versions of Pandas and Numpy - https://phabricator.wikimedia.org/T356230 (10nshahquinn-wmf) [01:53:33] 10Data-Engineering, 10Movement-Insights: Conda-Analytics packages incompatible with latest versions of Pandas and Numpy - https://phabricator.wikimedia.org/T356230 (10nshahquinn-wmf) [02:14:51] 10Data-Engineering, 10Movement-Insights: Package versions in Conda-Analytics are not pinned - https://phabricator.wikimedia.org/T356231 (10nshahquinn-wmf) [02:15:59] 10Data-Engineering, 10Movement-Insights: Conda-Analytics packages incompatible with latest versions of Pandas and Numpy - https://phabricator.wikimedia.org/T356230 (10nshahquinn-wmf) [02:20:10] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10nshahquinn-wmf) Okay, I've released Wmfdata 2.3.0. @BTullis while I was testing the new version, I noticed some dependency proble... [02:21:23] 10Data-Platform-SRE, 10Movement-Insights: Package versions in Conda-Analytics are not pinned - https://phabricator.wikimedia.org/T356231 (10nshahquinn-wmf) [02:21:39] 10Data-Platform-SRE, 10Movement-Insights: Conda-Analytics packages incompatible with latest versions of Pandas and Numpy - https://phabricator.wikimedia.org/T356230 (10nshahquinn-wmf) [05:35:48] (SystemdUnitFailed) firing: (13) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:50:45] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10CirrusSearch, 10Discovery-Search, 10serviceops: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10Gehel) Discussion with @Joe : no objection to enabling compaction as long as w... [09:35:48] (SystemdUnitFailed) firing: (13) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:58:29] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Pipelines, 10Patch-For-Review: Evaluate the PRESTO_EXPAND_DATA feature flag in superset - https://phabricator.wikimedia.org/T340144 (10brouberol) I've also raised the suggestion to [[ https://github.com/apache/superset/discussions/26915 | expand nested co... [10:12:02] (03PS1) 10Gehel: Simplifies CountryDatabaseReader. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/994674 [10:12:42] joal: ^ sorry for the drive by commit, but I needed some simple code to clear my head. [10:29:34] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10ops-monitoring-bot) Host rebooted by stevemunene@cumin1002 with reason: None [10:38:06] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): analytics/refinery scap deploy on test cluster fails with permission error - https://phabricator.wikimedia.org/T354703 (10BTullis) I believe that I can see the problem here. There are some stray files owned by root underneath the directory to b... [10:38:11] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): analytics/refinery scap deploy on test cluster fails with permission error - https://phabricator.wikimedia.org/T354703 (10BTullis) 05Open→03Resolved [10:51:10] (03PS3) 10Btullis: Improve the display of nested columns from presto [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/994213 (https://phabricator.wikimedia.org/T340144) [10:52:17] brouberol: I'll deploy https://gerrit.wikimedia.org/r/c/analytics/superset/deploy/+/994213 to an-tool1005 if you're ok with that. It will overwrite any local modifications that you have made. [10:53:22] for sure [10:53:33] please go ahead [10:53:40] Ack, thanks. [10:57:28] !log deploying https://gerrit.wikimedia.org/r/c/analytics/superset/deploy/+/994213 to superset-next to test nested display of presto columns [10:57:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:08:01] it seems to be working well 👍 [11:56:43] !log rebooting dbstore1008 for new kernel version (T356239) [11:56:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:00:12] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene) The new an-workers1157-1175 do not have any Virtual drive configured, however the datanode disks/partitions initialized are as expected. Comparing... [12:07:15] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10BTullis) Hmm, maybe the RAID controller on the new hosts has been set somehow to IT/JBOD mode, instead of RAID? We normally have to create a RAID0 logical volu... [12:12:01] !log rebooting dbstore1009 for new kernel version (T356239) [12:12:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:25:48] (SystemdUnitFailed) firing: (15) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:43:06] (03PS1) 10Aqu: Migrate session length to Iceberg [analytics/refinery] - 10https://gerrit.wikimedia.org/r/994697 (https://phabricator.wikimedia.org/T352672) [12:48:09] 10Data-Engineering (Sprint 8), 10Patch-For-Review: [Iceberg Migration] Migrate session length tables to Iceberg - https://phabricator.wikimedia.org/T352672 (10CodeReviewBot) aqu opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/600 Migrate session length to Iceberg [12:49:39] (03PS2) 10Aqu: Migrate session length to Iceberg [analytics/refinery] - 10https://gerrit.wikimedia.org/r/994697 (https://phabricator.wikimedia.org/T352672) [12:52:59] (03CR) 10Brouberol: [C: 03+1] Improve the display of nested columns from presto [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/994213 (https://phabricator.wikimedia.org/T340144) (owner: 10Btullis) [12:53:48] 10Data-Engineering, 10Epic: [Iceberg Migration] Apache Iceberg Migration - https://phabricator.wikimedia.org/T333013 (10Antoine_Quhen) [12:54:21] 10Data-Engineering (Sprint 8), 10Patch-For-Review: [Iceberg Migration] Migrate session length tables to Iceberg - https://phabricator.wikimedia.org/T352672 (10Antoine_Quhen) 05Open→03In progress [12:59:30] 10Data-Engineering (Sprint 8), 10Discovery-Search, 10Image-Suggestions: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15 - https://phabricator.wikimedia.org/T356030 (10Antoine_Quhen) wmf.wikidata_item_page_link/snapshot=202... [13:12:18] (03PS2) 10Gehel: Simplifies CountryDatabaseReader. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/994674 [13:19:23] (SystemdUnitFailed) firing: (15) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:55] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene) Saw some comments on some RAID config issues here https://phabricator.wikimedia.org/T349936#9360470 by @Papaul from the rack/setup task but not as... [13:24:29] 10Data-Engineering (Sprint 8): Add `event.app_donor_experience` fields to event sanitization allowlist - https://phabricator.wikimedia.org/T356214 (10lbowmaker) [14:27:29] 10Data-Engineering, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula, and 2 others: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10jcrespo) Let me know when you have a new backup i... [14:43:23] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10CirrusSearch, 10Discovery-Search, 10serviceops: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10brouberol) I'm going to go ahead, and go with solution 1. As there's so strong... [14:55:28] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10CirrusSearch, 10Discovery-Search, 10serviceops: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10brouberol) Looking at the [[ https://thanos.wikimedia.org/graph?g0.expr=sum(ka... [14:58:50] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10CirrusSearch, 10Discovery-Search, 10serviceops: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10brouberol) ` brouberol@kafka-main1003:~$ kafka configs --entity-type topics --... [15:03:39] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10CirrusSearch, 10Discovery-Search, 10serviceops: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10brouberol) The topic is so small, the effect of compaction went completely unr... [15:10:10] 10Data-Engineering, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula, and 2 others: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10Volans) @jcrespo we do have our first backup in `... [15:13:46] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Observability-Alerting: Fix "requests triggering circuit breakers" Elastic alert - https://phabricator.wikimedia.org/T355795 (10fgiunchedi) [15:14:58] 10Data-Engineering, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula, and 2 others: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10jcrespo) Let me do a manual run first and then we... [15:20:59] Hello, friends. I'm on the Ops Week rotation and there's a couple of things to deploy for analytics/refinery [15:22:08] AIUI none of them require a source build - they're all changes to supporting files, hql and csv [15:22:58] Should I just merge them and follow https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Deploy/Refinery#How_to_deploy? [15:29:07] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10CirrusSearch, 10Discovery-Search, 10serviceops: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10brouberol) a:03brouberol [15:29:19] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10CirrusSearch, 10Discovery-Search, 10serviceops: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10brouberol) ` brouberol@kafka-main1003:~$ kafka configs --entity-type topics --... [15:52:43] 10Data-Engineering (Sprint 8), 10Data Pipelines: [Refine refactoring] Refactor and migrate navigationtiming to Airflow - https://phabricator.wikimedia.org/T356192 (10Antoine_Quhen) [15:54:47] phuedx: Yes, I believe so. Did you get that support you were asking for regarding the Iceberg table drop/recreate? [15:59:41] btullis: That's very much out of my wheelhouse. Happy to learn but need to pair on it [16:01:38] 10Data-Platform-SRE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10mpopov) [16:01:59] 10Data-Platform-SRE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10mpopov) Thank you @AndrewTavis_WMDE for alerting us of this. Pinging @Manuel for visibility. [16:02:28] Hi phuedx - Yes, merging the refinery patches is the thing to do, then deploy as per docs in wikitech, and finaly operations (I'll surely help you with that) [16:12:09] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Pipelines, 10Patch-For-Review: Evaluate the PRESTO_EXPAND_DATA feature flag in superset - https://phabricator.wikimedia.org/T340144 (10brouberol) [16:12:20] (03PS1) 10Gehel: Simplify GeocodeDatabaseReader. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/994753 [16:12:54] 10Data-Platform-SRE, 10Discovery-Search: Clean up object storage in response to latest alert - https://phabricator.wikimedia.org/T356283 (10bking) [16:13:41] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Discovery-Search: Clean up object storage in response to latest alert - https://phabricator.wikimedia.org/T356283 (10Gehel) p:05Triage→03High [16:14:00] inflatador: ^ I'm moving this to our current milestone, unless you think otherwise [16:16:49] I'm going to merge all of those changes and sync them [16:17:18] brouberol: the IOPS / IOWAIT still seems reasonably low (in T354794). Do we have a measure of the impact on clients? Or do we have no one actively using eqiad at the moment? [16:17:19] T354794: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 [16:17:54] brouberol: also, this (T354794) is not blocked until we complete the DC switch? If so, could you add a comment so that we know when we can unblock? [16:18:43] I was actually doing this right now. The impact _could_ have been a slower producer (aka mirror maker) but the producer latency stayed flat throughout the operation, so no impact that I could see [16:19:09] do we know when the next failover will be? [16:19:10] joal: I... don't have +2 in analytics/refinery ^^ [16:19:22] Ah crap! [16:19:35] I'm gonna marge all that [16:19:48] https://gerrit.wikimedia.org/r/c/analytics/refinery/+/993478, https://gerrit.wikimedia.org/r/c/analytics/refinery/+/986839, and https://gerrit.wikimedia.org/r/c/analytics/refinery/+/994178 need to be merged [16:20:26] I'll note this down in the Ops Week log doc [16:20:48] Let's ask btullis if he may actually change this --^ [16:21:23] btullis: can we give +2 to phuedx on the analytics/refinery repo please? [16:21:50] brouberol: https://wikitech.wikimedia.org/wiki/Switch_Datacenter/Switchover_Dates [16:21:54] joal: Yes, I will check whether I can grant it asap. [16:22:08] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10CirrusSearch, 10Discovery-Search, 10serviceops: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10brouberol) 05Open→03Stalled This is blocked until the next codfw -> eqiad... [16:22:13] gehel: TIL, thanks! [16:22:15] Thanks btullis [16:22:16] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10CirrusSearch, 10Discovery-Search (Current work): SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 (10brouberol) [16:22:23] brouberol: and more context on https://wikitech.wikimedia.org/wiki/Switch_Datacenter [16:23:15] phuedx: Please could you check again now? [16:23:46] btullis: I have +2 now. Thanks! [16:23:57] \o/ [16:24:01] thanks a milion btullis [16:24:39] +1 [16:24:58] (03CR) 10Phuedx: [C: 03+2] "🚂" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/993478 (https://phabricator.wikimedia.org/T352669) (owner: 10TChin) [16:25:09] (03CR) 10Phuedx: [C: 03+2] "🚂" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/986839 (https://phabricator.wikimedia.org/T352671) (owner: 10TChin) [16:25:17] (03CR) 10Phuedx: [C: 03+2] "🚂" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/994178 (https://phabricator.wikimedia.org/T349743) (owner: 10Joal) [16:27:57] I need to V+2 'em as well I guess [16:28:28] Yes, I think so. Plus manual submit. [16:32:21] (03CR) 10Phuedx: [V: 03+2 C: 03+2] Use zstd compression for aqs_hourly [analytics/refinery] - 10https://gerrit.wikimedia.org/r/993478 (https://phabricator.wikimedia.org/T352669) (owner: 10TChin) [16:32:36] (03CR) 10Phuedx: [V: 03+2 C: 03+2] Add iceberg version of interlanguage_navigation table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/986839 (https://phabricator.wikimedia.org/T352671) (owner: 10TChin) [16:32:41] (03CR) 10Phuedx: [V: 03+2 C: 03+2] Update sqoop list adding new wikis [analytics/refinery] - 10https://gerrit.wikimedia.org/r/994178 (https://phabricator.wikimedia.org/T349743) (owner: 10Joal) [16:32:47] 10Data-Platform-SRE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Gehel) p:05Triage→03High [16:36:35] 10Data-Engineering, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula, and 2 others: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10Volans) All tests with @jcrespo for netbox were s... [16:37:50] 10Data-Engineering, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula, and 2 others: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10Volans) 05Open→03Resolved [16:39:04] 10Data-Engineering, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula, and 2 others: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10Volans) 05Resolved→03Open Sorry, moving in th... [16:48:29] phuedx: have you started deploying? [16:48:32] or not yet? [16:48:48] joal: Yes. It was logged in -operations [16:48:55] ok :) [16:48:59] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE-Access-Requests: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Gehel) [16:49:18] It's a habit we try to have to also manually log it in here - if you may [16:49:22] :) [16:52:47] !log Regular analytics weekly train [analytics/refinery@$(git rev-parse --short HEAD)] [16:52:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:52:57] phuedx: I unforntunately have a hot fix to push that'll require another refinery deploy :( [16:53:02] that's why I was asking [16:53:20] !log phuedx@deploy2002 Finished deploy [analytics/refinery@2c00cad]: Regular analytics weekly train [analytics/refinery@2c00cad1] (duration: 09m 52s) [16:53:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:53:32] joal: Happy to do it again so I can practice! [16:56:01] joal: I'll continue with the rest and redo it? [16:56:54] phuedx: As you wish :) [16:57:01] !log phuedx@deploy2002 Started deploy [analytics/refinery@2c00cad] (thin): Regular analytics weekly train THIN [analytics/refinery@2c00cad1] [16:57:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:57:02] My patches are ready, like, now :) [16:57:09] !log phuedx@deploy2002 Finished deploy [analytics/refinery@2c00cad] (thin): Regular analytics weekly train THIN [analytics/refinery@2c00cad1] (duration: 00m 06s) [16:57:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:57:26] (03PS1) 10Joal: Update pageview_actor to use fixed UDF [analytics/refinery] - 10https://gerrit.wikimedia.org/r/994788 [16:57:32] phuedx, aqu --^ [17:00:22] !log phuedx@deploy2002 Started deploy [analytics/refinery@2c00cad] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2c00cad1] [17:00:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:01:07] And: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/601 [17:01:33] !log phuedx@deploy2002 Finished deploy [analytics/refinery@2c00cad] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2c00cad1] (duration: 03m 35s) [17:02:04] joal: Fully deployed. That patch needs to be built and then synced, right? [17:02:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:03:23] phuedx: we should merge/submit the patch (refinery only), re-deploy refinery, then merge/deploy the airflow one [17:03:49] Oh, and not forget to deploy refinery onto HDFS [17:04:37] phuedx: I hope I'm clear :S If you need more detailed explanations, I can do that :) [17:07:22] btullis, joal: There are two items on the Etherpad that need an Iceberg table to be dropped, recreated, and backfilled. Is there a runbook for this? Is this something that either of you could help out with? [17:08:10] phuedx: no cookbook :( It needs to be done manually - I can help for sure [17:08:25] joal: Thanks. Want to jump on a call? [17:08:43] phuedx: do you mind if we do that after the re-deploy for the hotfix? [17:09:13] Not at all. Do you need someone to review your patch or are you happy to continue? [17:10:40] phuedx: If it makes sense to you, I'm fine with you merging :) [17:10:54] Have you got a link to the Airflow job? [17:11:14] I pasted it just above (the merge request) [17:11:17] or did I [17:11:18] ? [17:11:23] You did [17:11:25] I just missed it ^^ [17:11:28] :) [17:13:42] joal: Am I looking at the correct definition for the IsRedirectToPageview UDF? https://gerrit.wikimedia.org/g/analytics/refinery/source/+/67f849e36d35127c58a09b66eec67f87e16ad8d3/refinery-hive/src/main/java/org/wikimedia/analytics/refinery/hive/IsRedirectToPageviewUDF.java [17:14:26] I ask because that's documented as taking the x_analytics header [17:15:02] I think you Are - this uses some other code: https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/source/+/67f849e36d35127c58a09b66eec67f87e16ad8d3/refinery-hive/src/main/java/org/wikimedia/analytics/refinery/hive/IsPageviewUDF.java#148 [17:15:22] And this expects a map as x_analytics [17:16:38] I see I see [17:16:41] OK. Proceeding [17:17:00] (03CR) 10Phuedx: [V: 03+2 C: 03+2] "🚂" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/994788 (owner: 10Joal) [17:17:22] Thanks a lot <3 [17:19:23] (SystemdUnitFailed) firing: (13) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:19:38] !log phuedx@deploy2002 Started deploy [analytics/refinery@bef134c]: Regular analytics weekly train [analytics/refinery@bef134c2] [17:19:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:21:54] Used the wrong message *facepalm* Might as well continue to do so ^^ [17:30:42] !log phuedx@deploy2002 Finished deploy [analytics/refinery@bef134c]: Regular analytics weekly train [analytics/refinery@bef134c2] (duration: 11m 05s) [17:30:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:31:15] !log phuedx@deploy2002 Started deploy [analytics/refinery@bef134c] (thin): Regular analytics weekly train THIN [analytics/refinery@bef134c2] [17:31:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:31:28] !log phuedx@deploy2002 Finished deploy [analytics/refinery@bef134c] (thin): Regular analytics weekly train THIN [analytics/refinery@bef134c2] (duration: 00m 08s) [17:31:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:31:37] 10Data-Engineering: [Maintenance] Add a deletion job for `hdfs_usage` data - https://phabricator.wikimedia.org/T348774 (10Ahoelzl) [17:31:53] !log phuedx@deploy2002 Started deploy [analytics/refinery@bef134c] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@bef134c2] [17:31:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:35:29] !log phuedx@deploy2002 Finished deploy [analytics/refinery@bef134c] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@bef134c2] (duration: 03m 29s) [17:35:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:35:33] ^ joal [17:39:33] awesome [17:39:45] phuedx: have you deployed to HDFS? [17:40:12] !pause pageview_actor_hourly for deploy [17:40:16] !log pause pageview_actor_hourly for deploy [17:40:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:40:23] joal: About to do that now [17:40:28] perfect [17:41:59] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Discovery-Search (Current work), 10Patch-For-Review: Enable cross federation between experimental WDQS endpoints - https://phabricator.wikimedia.org/T355888 (10dcausse) [17:46:35] !log Deployed refinery using scap, then deployed onto hdfs [17:46:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:47:01] joal: ^ [17:47:19] \o/ [17:48:05] I'm almost done with my meeting - I'll help with the ops very soon [17:49:06] If you may phuedx, would you review and then deploy the airflow patch? [17:51:10] joal: I can try. I noticed that the pipeline has failed [17:58:42] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE-Access-Requests: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10BTullis) a:03BTullis [18:00:04] of course, my patch has CI failing :) Will fix that [18:07:25] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE-Access-Requests, 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10BTullis) It isn't 100% clear from the description whether the user should still have production shell acc... [18:09:48] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE-Access-Requests, 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10BTullis) Adding @MoritzMuehlenhoff for visibility - Do we need to do anything else regarding an off-board... [18:13:03] 10Data-Engineering (Sprint 8), 10Discovery-Search, 10Image-Suggestions: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15 - https://phabricator.wikimedia.org/T356030 (10mfossati) [18:20:43] joal: Should it be deployed on the analytics_test and analytics Airflow instances? [18:20:48] *on both the [18:21:03] phuedx: no need to deploy the test instance [18:21:06] only the analytics one [18:21:32] Cool. Just waiting for the build to succeed :) [18:21:47] Thanks so much :) [18:22:19] 10Data-Engineering (Sprint 8), 10Discovery-Search, 10Image-Suggestions: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15 - https://phabricator.wikimedia.org/T356030 (10mfossati) @dcausse , `analytics_platform_eng.image_sugg... [18:33:40] joal: Merged. Deploying now [18:36:47] joal: Paused. There have been a number of MRs merged since the last deployment (only yesterday) [18:36:58] wow [18:37:00] Is it OK to deploy them all? [18:37:23] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE-Access-Requests, 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Manuel) I looked into this: Goran, upon quitting his contract with WMDE, requested continued private data... [18:37:43] let me check :) [18:39:31] I think it's fine phuedx, you can go [18:39:38] I'll take responsibility if anything breaks :) [18:40:15] !log phuedx@deploy2002 Started deploy [airflow-dags/analytics@5078a6b]: (no justification provided) [18:40:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:40:48] !log phuedx@deploy2002 Finished deploy [airflow-dags/analytics@5078a6b]: (no justification provided) (duration: 00m 28s) [18:40:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:41:28] mforns: we're deploying your new DAG (about loading the AQS config to cassandra :) [18:41:34] mforns: We won't start it [18:44:33] awesome! thank you! [18:51:20] hm, I had paused the pageview_actor_hourly DAG, it's not paused anymore... weird [18:53:35] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE-Access-Requests, 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10BTullis) Ah, thanks @Manuel. It seems that I have acted with too much haste. I can revert the change then... [18:58:19] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE-Access-Requests, 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10mpopov) > We should add `expiry_date` and `expiry_contact` fields to reflect the NDA @KFrancis are you s... [19:02:56] joal: Can you help with the Iceberg tables? [19:03:05] phuedx: Yes! [19:03:20] phuedx: I can jump on a call if you wish now, or even just take care of the thing if you prefer [19:04:42] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE-Access-Requests, 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10BTullis) I have just deployed the revert, so the changes should be undone and the user should still have... [19:05:15] I have to get my youngest to bed in 5 minutes. I can jump on a call after? Or if you can do it in the meantime, that'd be great [19:05:32] I'm going for it :) [19:09:06] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE-Access-Requests, 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Manuel) > I'm not quite sure that I understand what this means, I just wanted to emphasize that WMDE ha... [19:11:37] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE-Access-Requests, 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Manuel) > Apologies for jumping the gun and revoking access before having thoroughly checked. All good... [19:11:59] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE-Access-Requests, 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10MoritzMuehlenhoff) >>! In T356279#9503507, @BTullis wrote: > We should add `expiry_date` and `expiry_cont... [19:14:28] !log Drop/Recreate wmf_traffic.aqs_hourly table (iceberg) to change compression format [19:14:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:14:49] !log Backfill wmf_traffic.aqs_hourly [19:14:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:17:26] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE-Access-Requests, 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10MoritzMuehlenhoff) >>! In T356279#9503576, @Manuel wrote: >> from a technical perpective. > > I don't kn... [19:22:01] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE-Access-Requests, 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Dzahn) Shouldn't we add the affected user to this ticket and ask them about all this? [19:41:31] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE-Access-Requests, 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Manuel) > it grants running commands under the analytics-wmde user on the stat* hosts Reading T310055#80... [19:43:05] (03PS1) 10Joal: Fix pageview_actor after incomplete previous fix [analytics/refinery] - 10https://gerrit.wikimedia.org/r/994816 [19:43:20] phuedx: for when you're back --^ [19:43:36] if you're not back in say, 5 minutes, I'll go ahead and merge myself :) [19:47:29] And obviously, I'll take care of deploying [19:57:46] (03CR) 10Joal: [V: 03+2 C: 03+2] "Self merging for hotfix" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/994816 (owner: 10Joal) [19:59:28] !log Deploying refinery with scap for second hotfix [19:59:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:04:18] joal: Thanks for sorting that out [20:05:19] np - I messed up the fix #face_plam [20:05:40] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE-Access-Requests, 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10mpopov) I'm also sorry for making the request without knowing about the prior request. >>! In T356279#95... [20:12:44] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE-Access-Requests, 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10AndrewTavis_WMDE) I apologize for my part of bringing this up with the assumption that such access isn't... [20:17:55] 10Data-Engineering, 10EventStreams, 10MediaWiki-General, 10Privacy Engineering, and 3 others: Create Mediawiki "oversightprotect" action that suppresses usernames of all edits of a page - https://phabricator.wikimedia.org/T354577 (10Htriedman) Hi @DannyS712! Have you made any progress on this? [20:47:41] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE-Access-Requests, 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10KFrancis) >>! In T356279#9503531, @mpopov wrote: >> We should add `expiry_date` and `expiry_contact` fiel... [21:20:49] (SystemdUnitFailed) firing: (13) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:51:00] 10Data-Engineering, 10EventStreams, 10MediaWiki-General, 10Privacy Engineering, and 3 others: Create Mediawiki "oversightprotect" action that suppresses usernames of all edits of a page - https://phabricator.wikimedia.org/T354577 (10DannyS712) >>! In T354577#9503830, @Htriedman wrote: > Hi @DannyS712! Have... [22:03:21] 10Data-Engineering, 10EventStreams, 10MediaWiki-General, 10Privacy Engineering, and 3 others: Create Mediawiki "oversightprotect" action that suppresses usernames of all edits of a page - https://phabricator.wikimedia.org/T354577 (10Htriedman) There's no huge rush — we've deployed a [[ https://phabricator.... [22:19:30] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the MiMoTextBase SPARQL endpoint - https://phabricator.wikimedia.org/T351488 (10RKemper) @HinMar Sorry for missing this request - our bad! I see your earlier comment mentioned the project expiring by end of 2023. Is the p... [22:22:19] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10RKemper) >>! In T339347#9431527, @Nikki wrote: > Could https://qlever.cs.uni-freiburg.de/api/wikimedia-commons also be a... [22:22:23] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Allow federated queries with the NFDI4Culture Knowledge Graph - https://phabricator.wikimedia.org/T346455 (10RKemper) @Loz.ross Sorry for the delay, we've added the endpoint. Can you confirm it's working with an example query? [22:47:14] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Change TLS/load balancer configuration for cloudelastic - https://phabricator.wikimedia.org/T355720 (10bking) We're about to roll back the last patch. Here's the error we're getting from puppet: ` Error: /Stage[main]/Profile::Elasticsearch::Ci... [22:55:15] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [23:00:15] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength