[01:12:41] 10Data-Engineering, 10Movement-Insights: [Data Quality] Implement basic data quality metrics for Unique Devices datasets - https://phabricator.wikimedia.org/T357833#9570810 (10Mayakp.wiki) sounds good ! thanks !! [07:46:36] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Create saved views for the superset deployment logs - https://phabricator.wikimedia.org/T356485#9571084 (10Stevemunene) [09:04:06] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279#9571245 (10MoritzMuehlenhoff) 05Open→03Resolved @AndrewTavis_WMDE @Manuel I've removed the access al... [09:15:35] 10Data-Platform-SRE, 10Discovery-Search, 10Observability-Metrics, 10MW-1.42-notes (1.42.0-wmf.20; 2024-02-27), and 2 others: Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting - https://phabricator.wikimedia.org/T350597#9571257 (10Gehel) [09:19:13] 10Data-Platform-SRE: Hardware requests for Data Platform Engineering - FY2024-2025 - https://phabricator.wikimedia.org/T358314#9571270 (10Gehel) [09:19:31] 10Data-Platform-SRE: Hardware requests for Data Platform Engineering - FY2024-2025 - https://phabricator.wikimedia.org/T358314#9571281 (10Gehel) p:05Triage→03High [09:19:33] 10Data-Platform-SRE: Check home/HDFS leftovers of goransm - https://phabricator.wikimedia.org/T358311#9571282 (10Gehel) p:05Triage→03Low [09:19:49] 10Data-Platform-SRE, 10Data-Platform: Review the Airflow instance security settings to ensure that they are still suitable - https://phabricator.wikimedia.org/T358137#9571283 (10Gehel) p:05Triage→03Medium [09:20:44] 10Data-Platform-SRE, 10Data-Platform: [Presto] Use JWT authentication instead of Kerberos for cluster-internal communication - https://phabricator.wikimedia.org/T358196#9571285 (10Gehel) p:05Triage→03Medium [09:52:24] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Discovery-Search (Current work), 10Documentation, 10Patch-For-Review: Review wikitech:Search and write processes for k8s world - https://phabricator.wikimedia.org/T356303#9571334 (10JMeybohm) Yesterday on IRC the question was raised: > this is probably the... [10:03:18] (03CR) 10Brouberol: [C: 03+1] Add stat1010 ans stat1011 to hdfs_tools target [analytics/hdfs-tools/deploy] - 10https://gerrit.wikimedia.org/r/1005538 (https://phabricator.wikimedia.org/T354526) (owner: 10Stevemunene) [10:23:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [10:32:28] joal: if you have a bit of time to review https://gitlab.wikimedia.org/repos/ci-tools/wmf-maven-tool-configs/-/merge_requests/6 and https://gitlab.wikimedia.org/repos/ci-tools/wmf-jvm-parent-pom/-/merge_requests/8 (more explanation of the context if needed on https://youtu.be/chKkKg9aWns) [10:32:59] joal: Oh, sorry, I just realized you already reviewed! [10:37:03] https://gitlab.wikimedia.org/repos/ci-tools/wmf-jvm-parent-pom/-/merge_requests/8 is ready for review (wmf-maven-tool-configs has been released). I might merge without waiting for your approval, this change seems so trivial. [10:47:48] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279#9571538 (10AndrewTavis_WMDE) Thank you, @MoritzMuehlenhoff! Really grateful to have this finalized. I'll... [11:09:44] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Serve Superset static assets from an optimised container - https://phabricator.wikimedia.org/T357890#9571626 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/30 Include the superse... [11:13:11] 10Data-Platform-SRE: Check home/HDFS leftovers of goransm - https://phabricator.wikimedia.org/T358311#9571637 (10AndrewTavis_WMDE) Thank you for making this, @MoritzMuehlenhoff! Linking the related cleanup task for Gerrit, {T357697}. Specifically in relation to this task, in T356618 we found `hdfs:///tmp/wmde/an... [11:29:50] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Serve Superset static assets from an optimised container - https://phabricator.wikimedia.org/T357890#9571671 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/30 Include the superse... [12:38:41] Team, I’ve had to step away for a couple of hours. I’ll log back during the afternoon/evening to catch-up [13:00:30] Can someone familiar with node.js/eventgate-main take a look at https://gerrit.wikimedia.org/r/1005974 ? [13:01:42] We think eventgate-main's sing worker exceeding a heap memory limit of 200MB that wasn't raised with the container's memory limit may explain the spikes in job enqueuing errors we see [13:01:48] s/sing/single/ [13:13:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:49:44] (03CR) 10Joal: "Being as nitpicky as possible - You miss the backfilling script in this patch as well." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1003928 (owner: 10Aleksandar Mastilovic) [14:32:55] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): RdfStreamingUpdaterSpaceUsageTooHigh - https://phabricator.wikimedia.org/T356313#9572075 (10bking) 05Open→03Invalid Closing as redundant; see T348685 for more details. [14:46:09] 10Data-Engineering, 10Edit-Review-Improvements-Integrated-Filters, 10Growth-Team, 10Machine-Learning-Team, and 2 others: Integration of Revert Risk Scores to Recent Changes as a filter - https://phabricator.wikimedia.org/T329071#9572117 (10Samwalton9-WMF) [15:21:41] claime: Looking now. Are you looking to deploy this today, or wait until next week? [15:22:45] btullis: I'm a bit worried about deploying it today because there's a risk that raising that limit + the overhead could cause the containers to OOM, but not immediately, and I don't want to cause an obscure issue during the week end [15:23:43] Yeah, that was my thinking too. Change looks good though. I'd be happy to keep an eye on it on Monday, if that helps. [15:25:08] sure, we can do that :) [15:25:26] I just hope that all the jobqueuerrors don't end up in too many lost jobs [15:26:31] Do they go back on another queue, or do they get dropped on the floor? Sorry, I don't know the jobqueue that well. [15:27:48] I'm not sure either [15:32:22] just confirmed, they're unfortunately lost for this error [15:32:39] most jobs will be retried a few times, but during the worst of it we were seeing ~4k failures per minute so there'll definitely be lost jobs in there [15:34:06] yeah, we're actually going to raise this to UBN actually [15:34:20] Wouldn't container oom events like this show up as kubernetes events? I'm seeing nothing here. [15:34:25] https://www.irccloud.com/pastebin/25ufGKwk/ [15:34:27] btullis: they're not technically oom [15:34:37] but check the container logs [15:34:43] you'll see the heap size limit warnings [15:36:02] I'll merge the patch, but if it doesn't fix it we have a problem either with eventgate-main itself or with kafka-main [15:36:03] Yep, I see. OK, shall we deploy now? [15:36:12] yep, merging [15:36:19] Ack [15:45:35] I'm seeing some readiness probes failing, but maybe that's normal. [15:47:57] Seems to be OK now. [15:49:10] So all of the pods in codfw have one restart, but seem to be OK now. Perhaps we need to increase the readiness timeout in future, to give them more time to start. [16:12:22] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Serve Superset static assets from an optimised container - https://phabricator.wikimedia.org/T357890#9572434 (10BTullis) I think that this is ready to go now. I got fairly confused because the requests to https://superset-next-k8s.wikimedia.or... [16:15:09] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438#9572445 (10BTullis) [16:15:16] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Data-Platform: Investigate late/delayed Airflow task failure notifications - https://phabricator.wikimedia.org/T358205#9572444 (10BTullis) 05Open→03Resolved [16:19:23] 10Data-Engineering, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, and 3 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745#9572462 (10Clement_Goubert) [16:31:58] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045#9572492 (10BTullis) a:05Stevemunene→03BTullis I have scheduled a maintenance window for Monday 26th at 11:00 when @Stevemunene and I will migrate this role to... [16:43:41] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): wmf.webrequest: 'presto error: Corrupted statistics for column "[user_agent] optional binary " in Parquet file ...' - https://phabricator.wikimedia.org/T320926#9572550 (10BTullis) 05Open→03Resolved a:03BTullis Thanks for checking @mpopov... [17:08:36] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Bring stat1010 into service with GPU from stat1005 - https://phabricator.wikimedia.org/T336040#9572577 (10BTullis) a:05Stevemunene→03BTullis Moving back to in-progress because we haven't moved the GPU. I have sent an email to analytics@li... [17:49:51] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): [superset k8s] Update the wikitech page with our production readiness checklist - https://phabricator.wikimedia.org/T356486#9572732 (10BTullis) I suggest that we actually move these checklists to here: https://wikitech.wikimedia.org/wiki/Analy... [17:51:46] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Archive /home/ezachte data on stat1007 - https://phabricator.wikimedia.org/T238243#9572736 (10BTullis) a:03BTullis [17:52:23] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10DC-Ops, 10SRE, 10ops-eqiad: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9572740 (10BTullis) [18:24:54] (03PS3) 10Aleksandar Mastilovic: Add HQL file for CX report [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1003928 [18:31:31] (03CR) 10Aleksandar Mastilovic: "I've made changes that addressed some (most?) of your comments before actually seeing the comments." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1003928 (owner: 10Aleksandar Mastilovic) [19:05:14] (03PS1) 10Sbisson: Wikistories contribution: declare dt field [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/1006060 (https://phabricator.wikimedia.org/T343183) [20:57:42] (03CR) 10Aqu: "A fct was partially renamed." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1003745 (https://phabricator.wikimedia.org/T356363) (owner: 10Joal) [21:00:13] 10Data-Engineering, 10Scap: Can't deploy airflow-dags/research anymore - https://phabricator.wikimedia.org/T311336#9573392 (10dancy) [21:20:26] 10Data-Platform-SRE, 10Discovery-Search: Determine cause/fix cross-cluster search missing index errors - https://phabricator.wikimedia.org/T358389#9573485 (10bking) [22:12:36] 10Data-Engineering, 10Scap, 10Patch-For-Review: Can't deploy airflow-dags/research anymore - https://phabricator.wikimedia.org/T311336#9573628 (10CodeReviewBot) dancy opened https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/224 Fix longstanding bug in git.next_deploy_tag() [22:31:57] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Discovery-Search (Current work), 10Documentation, 10Patch-For-Review: Review wikitech:Search and write processes for k8s world - https://phabricator.wikimedia.org/T356303#9573650 (10EBernhardson) >>! In T356303#9571334, @JMeybohm wrote: > Yesterday on IRC t... [23:01:39] 10Data-Engineering, 10Scap, 10Patch-For-Review: Can't deploy airflow-dags/research anymore - https://phabricator.wikimedia.org/T311336#9573705 (10CodeReviewBot) thcipriani merged https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/224 Fix longstanding bug in git.next_deploy_tag()