[08:25:04] Hello! We have received a lot of SLA alerts. e.g. AQS is waiting for this partition: wmf.webrequest/webrequest_source=text/year=2023/month=1/day=8/hour=7 [08:25:04] I'm investigating. [08:27:24] aqu: o/ same here, some of our jobs waiting for events in /mnt/hdfs/wmf/data/event/ stopped functionning on 2023-01-07T17:00:00 (waiting for this partition to exist) [08:29:15] seems like all event streams stopped been refined around 2023-01-07T17:00:00 [08:34:19] kafka -> hdfs seems to work I see data in my stream partitions under hdfs:///wmf/data/raw/event/ [08:45:47] Yes, I'm looking for a refine error. And a way to relaunch the process. [08:49:32] Back online! [08:52:24] aqu: thanks for looking into it! :) [09:07:32] The refine_event job running since 1/7 https://yarn.wikimedia.org/cluster/app/application_1663082229270_682638 [09:18:22] So I'm about to `sudo systemctl kill refine_event.service` an an-launcher, because it's stucked doing nothing I think. Then I will restart it. [09:19:06] Hello joal ! [09:39:08] !log Manually kill the Spark process on an-launcher1002 `sudo -u analytics kill -9 28538` [09:39:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:40:46] `systemctl status refine_event.service` is now in inactive mode. It's weird. I would have expected it in failed mode, as one of its processes was killed. [09:44:54] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10SRE-OnFire, 10serviceops, and 2 others: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10JMeybohm) >>! In T324994#8463619, @Clement_Goubert wrote: > We have the resources to keep it at 30 replic... [09:48:06] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10SRE-OnFire, 10serviceops, and 2 others: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert) No worries, I took a look at the resources and it seemed fine to leave it like that. We... [09:48:34] !log killing refine_event yarn application `sudo -u analytics yarn application -kill application_1663082229270_682638` [09:48:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:50:24] I'm also here, in case I can be of help with anything. [10:00:30] Hello btullis ! With Sandra, we are currently building the backfilling command with all tables. [10:01:37] aqu: Great, thanks. Do feel free to ping me if I can help. [10:18:02] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] Streaming services errors should be routed to an error event topic. - https://phabricator.wikimedia.org/T326536 (10gmodena) [10:21:53] !log backfilling with refine_event on an-launcher1002 `sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_event --ignore_failure_flag=true --since=2023-01-07T16:00:00 --until=2023-01-09T09:00:00 --verbose` [10:21:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:26:59] 10Data-Engineering: an-worker1132 down - https://phabricator.wikimedia.org/T326459 (10jcrespo) Removing SRE, as this has been so far correctly routed to #data-engineering , but please revert the #SRE tag with potentially a more concrete team one (e.g. #ops-eqiad ) for hardware maintenance after they (@BTullis?)... [10:37:58] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [EPIC] Streaming and event driven Python services - https://phabricator.wikimedia.org/T324689 (10gmodena) >>! In T324689#8500218, @Ottomata wrote: > It took me forever to realize how this works for the page content enrich pipeline: the output is the s... [10:49:56] looks like there only is the webrequests issues to take care of [10:50:21] thanks a lot aqu for killing-rerunning the refine job [10:51:08] This is the second time this happens in a month - possibly we'll have to get to the root cause of that one (I don't wish we do, ver time consuming I assume) [10:56:12] and SandraEbele :) [10:57:00] Ah! you were the only one logging aqu, and therefore I had not seen SandraEbele in the mix - thanks a lot SandraEbele ) [10:57:24] our jobs resumed, thanks! :) [10:58:28] \o/\ [11:17:28] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [EPIC] Streaming and event driven Python services - https://phabricator.wikimedia.org/T324689 (10gmodena) >>! In T324689#8503145, @Ottomata wrote: > Okay, I think I have something working? Terrific! LGTM. Maybe we can test it out within the scope o... [11:34:02] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10EChetty) a:03BTullis [11:42:11] 10Data-Engineering, 10Equity-Landscape: Grants input metric - https://phabricator.wikimedia.org/T309276 (10KCVelaga_WMF) The Affiliates part of grants is QAed here: https://docs.google.com/spreadsheets/d/1yx4x96407HT9fTq1KrQxB_ZChKK8bJ9_NKGRPqynNjA/edit?pli=1#gid=0&range=V2:Y128 [12:20:13] (VarnishkafkaNoMessages) firing: varnishkafka on cp5025 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5025%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [12:20:43] ^^ cp5025 got depooled to fix some purged issues [12:20:47] (is pooled back again) [12:24:26] vgutierrez: Ack, thanks. Still working towards eliminating these false positives with VarnishKafkaNoMessages but we're not there yet. [12:25:13] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5025 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5025%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [12:56:39] We have webrequest data loss for 20230108. And Seems like it’s mostly from the cp*.eqsin.wmnet host [13:33:44] SandraEbele: eqsin had some issues yesterday [14:20:08] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 06): Flink wrappers and helper libraries should be moved into a dedicated git repo with packaging and CI. - https://phabricator.wikimedia.org/T324746 (10JArguello-WMF) [14:20:10] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 06): k8s deployment-charts mesh module should allow use of mesh without public_port Service - https://phabricator.wikimedia.org/T326252 (10JArguello-WMF) [14:20:14] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 06): We should provide utilities for local development and unit testing of Python streaming services - https://phabricator.wikimedia.org/T324951 (10JArguello-WMF) [14:20:25] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 06): Flink Tables should have a default ROWTIME column. - https://phabricator.wikimedia.org/T324144 (10JArguello-WMF) [14:20:44] 10Data-Engineering-Planning, 10serviceops, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 06), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10JArguello-WMF) [14:21:35] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 06): Flink Tables should have a default ROWTIME column. - https://phabricator.wikimedia.org/T324144 (10JArguello-WMF) 05Open→03Resolved [14:25:50] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 06): Deploy Mediawiki Stream Enrichment on an-launcher1002. - https://phabricator.wikimedia.org/T323914 (10JArguello-WMF) [14:26:05] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 06): [SPIKE] Evaluate a pyflink version of Mediawiki Stream Enrichment - https://phabricator.wikimedia.org/T323217 (10JArguello-WMF) [14:28:04] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 06): Flink + Event Platform integration for writing into streams via Table API - https://phabricator.wikimedia.org/T324114 (10JArguello-WMF) [14:28:06] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 06): Spark Streaming Dumps POC: Backfill content table - https://phabricator.wikimedia.org/T323641 (10JArguello-WMF) [14:28:08] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 06): Spark Streaming Dumps POC: Update iceberg tables - https://phabricator.wikimedia.org/T323645 (10JArguello-WMF) [14:28:10] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 06): Spark Streaming Dumps POC: Backfill metadata table - https://phabricator.wikimedia.org/T323642 (10JArguello-WMF) [14:28:21] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 06): Flink SQL queries should access Kafka topics from a Catalog - https://phabricator.wikimedia.org/T322022 (10JArguello-WMF) [14:28:43] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 06), 10Patch-For-Review: Flink application and flink-kubernetes-operator production docker images - https://phabricator.wikimedia.org/T316519 (10JArguello-WMF) [14:29:07] 10Data-Engineering, 10Pageviews-API, 10Pageviews-Anomaly: Pageviews data dumps are not being created - https://phabricator.wikimedia.org/T326559 (10Aquameta) [14:48:25] !log reran webrequest failed jobs ‘sudo -u analytics kerberos-run-command analytics oozie job --oozie $OOZIE_URL -Dstart_time=2023-01-08T07:00Z -Dstop_time=2023-01-08T14:59Z -Dwebrequest_source=text -Derror_incomplete_data_threshold=100 -Dwarning_incomplete_data_threshold=100 -Derror_data_loss_threshold=100 -Dwarning_data_loss_threshold=100 -submit -config /home/ebysans/webrequest_text_coordinator.properties’ [14:48:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:50:01] o/ our jobs are stuck again waiting for refined partitions on the snapshot 2023-01-09T09:00:00 [14:50:59] aqu: ^ [15:01:58] dcausse, It's expected. We have relaunched refine_jobs for a lot of missing partitions from 1/7 till this morning UTC (1/9 8am). The hive tables you needed were refined in the beginning process, and we are waiting for the job to finish. Then we will catch up the last missing hours of today. And finally we will reactivate the systemd job. [15:01:58] Are your missing partitions needed urgently? [15:02:40] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 06): Flink wrappers and helper libraries should be moved into a dedicated git repo with packaging and CI. - https://phabricator.wikimedia.org/T324746 (10JArguello-WMF) 05Open→03Resolved [15:02:42] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [EPIC] Streaming and event driven Python services - https://phabricator.wikimedia.org/T324689 (10JArguello-WMF) [15:07:38] aqu: no this is not urgent, just wanted to raise this in case this was not expected, thanks for the info! :) [15:10:03] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 06), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10lbowmaker) [15:50:44] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Gitlab CI pipeline for Pytthon applications should bundle Java eventutilities and runtime deps - https://phabricator.wikimedia.org/T326567 (10gmodena) [15:58:12] 10Data-Engineering-Planning: Document how to show your work in phabricator and/or elsewhere - https://phabricator.wikimedia.org/T324796 (10odimitrijevic) 05Open→03Resolved Thanks Andrew for the write up! [16:06:31] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 05-06), 10Patch-For-Review, 10SecTeam-Processed, 10Vuln-VulnComponent: Upgrade Puppet code to make Airflow configuration files compatible with version 2.3.4 - https://phabricator.wikimedia.org/T315580 (10Antoine_Quhen) a:05Antoine_Quhen→03Stevemune... [16:22:06] 10Analytics-Wikistats, 10Data-Engineering: Stats page - https://phabricator.wikimedia.org/T324993 (10Pyb) You can use Wikiscan. Eg http://de.wikiscan.org/users There is a limitation for the English Wikipedia, they can't sort per year. Otherwise, there is every wiki [16:38:39] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 05-06), 10Patch-For-Review, 10SecTeam-Processed, 10Vuln-VulnComponent: Upgrade Puppet code to make Airflow configuration files compatible with version 2.3.4 - https://phabricator.wikimedia.org/T315580 (10Ottomata) @Stevemunene This should be doable wi... [17:02:58] @btullis o/ have you had any trouble with mvn ... -Djava.net.useSystemProxies when building spark image on build servers? [17:04:53] ottomata: No, I don't think so. Oh, now I look I haven't started using it yet. I didn't submit my change: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/864770 [17:05:28] it works fine for me everywghere i've used it, but apparently not on prod servers using docker-pkg [17:05:44] the http_proxy env vars are def set. [17:06:09] trying to work around by parsing http_proxy env vars into host and port (annoying cuz multtiple : in proto in uri) [17:08:31] (03CR) 10Mforns: [C: 03+2] "I like this solution! You managed to implement it without adding parameters to the script :-)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/870971 (https://phabricator.wikimedia.org/T323614) (owner: 10Xcollazo) [17:09:28] !log Relaunching refine_event after partial backfilling `sudo systemctl start refine_event.service` (an-launcher1002) [17:09:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:09:47] ottomata: I can merge my change and try building it, if that might help you troubleshoot. [17:10:11] i'm doing mine now [17:10:22] haha crazy bash stuff to the rescue? [17:10:31] ENV MVN_HTTP_PROXY_OPTION=${http_proxy:+"-Dhttp.proxyHost=${http_proxy%:*} -Dhttp.proxyPort=${http_proxy##*:}"} [17:30:10] (03CR) 10Xcollazo: Modify refinery-drop-older-than to support 'snapshot' partitions (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/870971 (https://phabricator.wikimedia.org/T323614) (owner: 10Xcollazo) [18:14:33] (03PS3) 10Xcollazo: Modify refinery-drop-older-than to support 'snapshot' partitions [analytics/refinery] - 10https://gerrit.wikimedia.org/r/870971 (https://phabricator.wikimedia.org/T323614) [18:16:06] (03CR) 10Xcollazo: Modify refinery-drop-older-than to support 'snapshot' partitions (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/870971 (https://phabricator.wikimedia.org/T323614) (owner: 10Xcollazo) [18:18:56] joal: you still around? wanna discuss Airflow Druid loading? [18:19:08] If too late, then tomorrow! [18:25:02] hey mforns - sorry I missed your ping [18:25:18] mforns: I'll be in meeting in 5 minutes - ok for ou tomorrow afternoon? [18:25:20] no problemo, please let's do tomorrow if too late! [18:25:25] yes, ofc [18:25:34] super :) thanks [18:25:40] 👍 [18:33:17] 10Data-Engineering-Planning, 10Machine-Learning-Team, 10Research: Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score- - https://phabricator.wikimedia.org/T317768 (10diego) Hi all, we have a use case here T326179. These models are already hos... [18:41:45] (03CR) 10Ottomata: [C: 03+1] "One nit, but looks fine to me either way" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/870971 (https://phabricator.wikimedia.org/T323614) (owner: 10Xcollazo) [19:17:26] (03CR) 10Xcollazo: Modify refinery-drop-older-than to support 'snapshot' partitions (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/870971 (https://phabricator.wikimedia.org/T323614) (owner: 10Xcollazo) [19:31:03] 10Data-Engineering, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), 10Epic, and 2 others: AQS 2.0: Unique Devices service - https://phabricator.wikimedia.org/T288298 (10FGoodwin) [19:32:30] 10Data-Engineering, 10Product-Analytics: Add editors_monthly data to Druid - https://phabricator.wikimedia.org/T256719 (10kzimmerman) 05Open→03Declined This work relating to editors is now under T307883 [20:12:17] (03PS4) 10Xcollazo: Modify refinery-drop-older-than to support 'snapshot' partitions [analytics/refinery] - 10https://gerrit.wikimedia.org/r/870971 (https://phabricator.wikimedia.org/T323614) [20:15:33] (03CR) 10Xcollazo: Modify refinery-drop-older-than to support 'snapshot' partitions (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/870971 (https://phabricator.wikimedia.org/T323614) (owner: 10Xcollazo) [20:34:27] (03CR) 10Ottomata: [C: 03+1] Modify refinery-drop-older-than to support 'snapshot' partitions [analytics/refinery] - 10https://gerrit.wikimedia.org/r/870971 (https://phabricator.wikimedia.org/T323614) (owner: 10Xcollazo) [20:34:37] (03CR) 10Ottomata: [C: 03+1] Modify refinery-drop-older-than to support 'snapshot' partitions (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/870971 (https://phabricator.wikimedia.org/T323614) (owner: 10Xcollazo) [20:40:50] (03CR) 10Xcollazo: [V: 03+2] "Ok verifying as per unit tests, and I also ran this branch on stat1007 which shows the right output for a dry-run." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/870971 (https://phabricator.wikimedia.org/T323614) (owner: 10Xcollazo) [20:41:23] (03CR) 10Xcollazo: [V: 03+2] Modify refinery-drop-older-than to support 'snapshot' partitions (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/870971 (https://phabricator.wikimedia.org/T323614) (owner: 10Xcollazo) [20:53:46] (03CR) 10Mforns: [C: 03+2] "LGTM! Thanks for all the changes!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/870971 (https://phabricator.wikimedia.org/T323614) (owner: 10Xcollazo) [21:56:59] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 04): Prototype Flink job for content Dumps - https://phabricator.wikimedia.org/T320966 (10Milimetric) note for myself: https://github.com/apache/iceberg/pull/6182/files is recent activity about supporting deletes in future Flink / Iceberg APIs [22:50:35] PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [23:03:55] RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [23:12:38] 10Data-Engineering-Planning: Ingest feature Hive schema into datahub - https://phabricator.wikimedia.org/T326598 (10odimitrijevic) [23:12:56] 10Data-Engineering-Planning, 10Data-Catalog: Ingest feature Hive schema into datahub - https://phabricator.wikimedia.org/T326598 (10odimitrijevic) [23:44:20] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MediaWiki-Core-Hooks, 10Patch-For-Review: Create PageUndeleteComplete hook, analogous to PageDeleteComplete - https://phabricator.wikimedia.org/T321412 (10OwenRB) p:05Triage→03Low a:03OwenRB I took a little go at this - any feedback is app...