[02:22:28] (VarnishkafkaNoMessages) firing: ... [02:22:28] varnishkafka for instance cp3060:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=cp3060:9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [06:22:28] (VarnishkafkaNoMessages) firing: ... [06:22:28] varnishkafka for instance cp3060:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=cp3060:9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [06:53:22] 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, 10Superset: Error when updating dashboard - https://phabricator.wikimedia.org/T308441 (10Pablo) Yes :) Copy-pasted from Slack: > Having emojis on that dashboard is anything but a priority, thank you so much @Razzi for your quick and effe... [07:11:37] 10Analytics, 10Product-Analytics: No data from ptwikinews in event.mediawiki_mediasearch_interaction table - https://phabricator.wikimedia.org/T308815 (10cchen) [07:14:06] 10Analytics, 10Product-Analytics, 10Structured-Data-Backlog: No data from ptwikinews in event.mediawiki_mediasearch_interaction table - https://phabricator.wikimedia.org/T308815 (10cchen) [07:17:32] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform, 10Patch-For-Review: [Shared Event Platform] Ability to use Event Platform streams in Flink without boilerplate - https://phabricator.wikimedia.org/T308356 (10JAllemandou) There is work in hive on that front but AFAICS it's for version 3+: htt... [08:47:10] 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, 10Patch-For-Review: Enable Cassandra encryption (inter-node & client) - https://phabricator.wikimedia.org/T307798 (10BTullis) I have generated the certificates and keys on the puppetmaster. As suggested I took a verbatim copy of https://gerrit.wi... [08:55:51] 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, 10Patch-For-Review: Enable Cassandra encryption (inter-node & client) - https://phabricator.wikimedia.org/T307798 (10BTullis) I believe that https://gerrit.wikimedia.org/r/c/operations/puppet/+/791663 is now ready for deployment, so I'm happy to d... [09:22:13] (VarnishkafkaNoMessages) resolved: ... [09:22:13] varnishkafka for instance cp3060:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=cp3060:9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:57:46] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE Observability, and 2 others: Migrate analytics cluster alerts from Icinga to AlertManager - https://phabricator.wikimedia.org/T293399 (10BTullis) Having reviewed this with the Data Engineering team, we would like to create separate ta... [12:09:28] 10Quarry, 10good first task: Quarry, unable to run tests following the README.md - https://phabricator.wikimedia.org/T308493 (10rook) [12:09:45] 10Quarry, 10good first task: Escape special characters in results - https://phabricator.wikimedia.org/T308362 (10rook) [12:10:02] 10Quarry, 10Regression, 10good first task: Bad resultset number case is not handled - https://phabricator.wikimedia.org/T218470 (10rook) [12:10:27] 10Quarry, 10Documentation, 10good first task: Landing page for Quarry - https://phabricator.wikimedia.org/T308783 (10rook) [12:10:44] 10Quarry, 10good first task: Define in a single place the pseudoname of unnamed queries - https://phabricator.wikimedia.org/T197029 (10rook) [12:21:09] heya joal :] I'm looking at the Airflow alert, it seems the wikidata_json_to_hive job has failed because of OOM errors in the workers. The input data is 68.8 GB, and the executor_memory = 4GB. Although the dynamic-allocation is enabled and spark.dynamicAllocation.maxExecutors'= 64, it seems it needs more memory... [12:22:03] I want to increase the memory, but my question would be if it's recommended to stick to powers of 2 -> executor_memory=8GB, or that would be too much? [12:22:18] whatcha think? [12:23:08] heya mforns - Indeed I noticed this too when testing spark3 with this job - we should change both the wkidata_json_to_hive and common [12:23:26] Going to 8g per executor solved the problem, so let's have that [12:23:36] thanks a lot for looking into that mforns :) [12:36:02] ok, will change and deploy [12:41:31] <3 [12:41:43] mforns: I can review if you wish [12:41:54] joal: I already merged :P [12:42:08] deploying [12:42:12] sorry for that [12:42:46] np thanks mforns :) [13:16:09] joal: I'm having problems with the deploy :((( Actually I think it's not the deployment, it went fine, but Airflow is not updating the DAG code! I checked in the production DAG folder, and the code is updated. But the UI is still showing the old code... after 30 minutes! [13:16:25] MEH :( [13:16:59] seems to me Airflow is stuck... [13:18:25] when I tried the code changes in the dev instance, they showed up fine, so I don't think it's a DAG interpretation issue either [13:21:53] I restarted Airflow, and it still shows the old code, and the sensors are not checking correctly I think, maybe a database connection thing? [13:25:49] hm [13:32:22] have you restarted all of airflow mforns ? [13:32:59] not sure, but I restarted the DAG file processing for sure, I can see it in the logs. [13:33:16] And since then, all DAGs except for 6, are not being read [13:33:34] oh. now 7 [13:33:47] hm, it seems to be slowly catching up [13:33:55] but... this is weeiiirdd [13:34:15] weird indeed! [13:35:54] I'm looking at an-launcher1002 top - it's busy - not overwhelmed but busy [13:36:38] joal: how about an-coord1001? [13:37:01] is it 1001 or 1002 the mariadb one? looking [13:37:12] an-coord1001 is fine [13:37:49] mysql in 1001 60% CPU, is that ok? [13:39:32] mforns: the host is multi-core, so yes it's very ok - the host is not using it's cores [13:39:49] ah I see [13:48:36] hm [13:48:57] There are many airflow processes on an-launcher [13:49:11] it doesn't lack memory and CPU is not 100% [13:49:34] yes [13:49:46] I think I found something interesting here: [13:50:23] /srv/airflow-analytics/logs/dag_processor_manager/dag_processor_manager.log [13:51:13] It seems it only manages to read part of the DAGs, most dags don't get interpreted [13:51:51] the DAG parsing happens every 30 seconds, and the timeout for interpreting the whole dag_bag is IIUC 30 seconds. [13:52:38] hm [13:52:42] It says DAGs take about 1 second or less to parse. [13:52:45] How come it doens't manage to do that? [13:52:47] hm [13:52:50] yea [13:52:53] probably not enough CPU left I asuume [13:53:04] There are not that many jobs! [13:53:48] no... [13:53:49] 17 [13:54:42] there seem to be reaaly a lot of thread/processes for airflow [13:54:58] mforns: could it be that you have a test instance running there? [13:55:05] yes [13:55:09] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE Observability, and 2 others: Migrate analytics cluster alerts from Icinga to AlertManager - https://phabricator.wikimedia.org/T293399 (10fgiunchedi) >>! In T293399#7944047, @BTullis wrote: > Having reviewed this with the Data Engineer... [13:55:10] would you stop it? [13:56:03] mforns: many local executors currently running on the machione! [13:56:47] ok, killed the dev instance and all my airflow threads [13:59:01] oh! joal, all dags are parsed now! [13:59:07] bingo [13:59:25] code is up to date [13:59:32] oh my gooood [13:59:36] There are other things I'm sure (host too busy) - but it's a bad idea to use the production machine as a test env [13:59:49] mforns: I'm gonna go to razzi's session :) [13:59:53] obvious... bad marcel! [14:00:04] 👍 [14:01:49] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE Observability, and 2 others: Migrate analytics cluster alerts from Icinga to AlertManager - https://phabricator.wikimedia.org/T293399 (10BTullis) Thanks @fgiunchedi - I'll create those follow-ups. > May I ask what's the difference on... [14:23:35] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE Observability, and 2 others: Migrate analytics cluster alerts from Icinga to AlertManager - https://phabricator.wikimedia.org/T293399 (10fgiunchedi) >>! In T293399#7944787, @BTullis wrote: > Thanks @fgiunchedi - I'll create those foll... [14:24:15] 10Analytics, 10Product-Analytics, 10SDAW-MediaSearch, 10Structured-Data-Backlog: No data from ptwikinews in event.mediawiki_mediasearch_interaction table - https://phabricator.wikimedia.org/T308815 (10CBogen) [18:39:43] Is there public access to Kafka, or should third-party apps use EventStreams instead? [18:50:42] Never mind—I was hoping to use an existing Kafka client but looking more closely I see that it doesn't support at-least-once processing out of the box so I might as well just keep improving the SSE client I've started already. [22:33:24] 10Analytics, 10LDAP-Access-Requests, 10SRE: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10Dzahn) I think we should escalate this directly to the analytics team for advice how to move forward. Let me add them. [22:33:28] 10Analytics, 10LDAP-Access-Requests, 10SRE: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10Dzahn) [22:33:39] 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, 10Superset: Help with data that's not appearing on charts - https://phabricator.wikimedia.org/T301895 (10Mayakp.wiki) Thanks @BTullis . We are currently working on updating Active Editor charts T307143 and will change the chart type as a p... [23:47:15] 10Analytics, 10Data-Engineering-Radar, 10Event-Platform, 10Metrics-Platform, 10Browser-Support-Microsoft-Edge: Problem with delay caused by intake-analytics.wikimedia.org - https://phabricator.wikimedia.org/T295427 (10Downsize43) Not sure what table of mine you used for your test. The problem remains on...