[02:22:28] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: ...
[02:22:28] <jinxer-wm>	 varnishkafka for instance cp3060:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=cp3060:9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[06:22:28] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: ...
[06:22:28] <jinxer-wm>	 varnishkafka for instance cp3060:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=cp3060:9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[06:53:22] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, 10Superset: Error when updating dashboard - https://phabricator.wikimedia.org/T308441 (10Pablo) Yes :) Copy-pasted from Slack:   > Having emojis on that dashboard is anything but a priority, thank you so much @Razzi for your quick and effe...
[07:11:37] <wikibugs>	 10Analytics, 10Product-Analytics: No data from ptwikinews in event.mediawiki_mediasearch_interaction table - https://phabricator.wikimedia.org/T308815 (10cchen)
[07:14:06] <wikibugs>	 10Analytics, 10Product-Analytics, 10Structured-Data-Backlog: No data from ptwikinews in event.mediawiki_mediasearch_interaction table - https://phabricator.wikimedia.org/T308815 (10cchen)
[07:17:32] <wikibugs>	 10Data-Engineering, 10Event-Platform, 10Generated Data Platform, 10Patch-For-Review: [Shared Event Platform] Ability to use Event Platform streams in Flink without boilerplate - https://phabricator.wikimedia.org/T308356 (10JAllemandou) There is work in hive on that front but AFAICS it's for version 3+: htt...
[08:47:10] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, 10Patch-For-Review: Enable Cassandra encryption (inter-node & client) - https://phabricator.wikimedia.org/T307798 (10BTullis) I have generated the certificates and keys on the puppetmaster.  As suggested I took a verbatim copy of https://gerrit.wi...
[08:55:51] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, 10Patch-For-Review: Enable Cassandra encryption (inter-node & client) - https://phabricator.wikimedia.org/T307798 (10BTullis) I believe that https://gerrit.wikimedia.org/r/c/operations/puppet/+/791663 is now ready for deployment, so I'm happy to d...
[09:22:13] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: ...
[09:22:13] <jinxer-wm>	 varnishkafka for instance cp3060:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=cp3060:9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:57:46] <wikibugs>	 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE Observability, and 2 others: Migrate analytics cluster alerts from Icinga to AlertManager - https://phabricator.wikimedia.org/T293399 (10BTullis) Having reviewed this with the Data Engineering team, we would like to create separate ta...
[12:09:28] <wikibugs>	 10Quarry, 10good first task: Quarry, unable to run tests following the README.md - https://phabricator.wikimedia.org/T308493 (10rook)
[12:09:45] <wikibugs>	 10Quarry, 10good first task: Escape special characters in results - https://phabricator.wikimedia.org/T308362 (10rook)
[12:10:02] <wikibugs>	 10Quarry, 10Regression, 10good first task: Bad resultset number case is not handled - https://phabricator.wikimedia.org/T218470 (10rook)
[12:10:27] <wikibugs>	 10Quarry, 10Documentation, 10good first task: Landing page for Quarry - https://phabricator.wikimedia.org/T308783 (10rook)
[12:10:44] <wikibugs>	 10Quarry, 10good first task: Define in a single place the pseudoname of unnamed queries - https://phabricator.wikimedia.org/T197029 (10rook)
[12:21:09] <mforns>	 heya joal :] I'm looking at the Airflow alert, it seems the wikidata_json_to_hive job has failed because of OOM errors in the workers. The input data is 68.8 GB, and the executor_memory = 4GB. Although the dynamic-allocation is enabled and spark.dynamicAllocation.maxExecutors'= 64, it seems it needs more memory...
[12:22:03] <mforns>	 I want to increase the memory, but my question would be if it's recommended to stick to powers of 2 -> executor_memory=8GB, or that would be too much?
[12:22:18] <mforns>	 whatcha think?
[12:23:08] <joal>	 heya mforns - Indeed I noticed this too when testing spark3 with this job - we should change both the wkidata_json_to_hive and common
[12:23:26] <joal>	 Going to 8g per executor solved the problem, so let's have that
[12:23:36] <joal>	 thanks a lot for looking into that mforns :)
[12:36:02] <mforns>	 ok, will change and deploy
[12:41:31] <joal>	 <3
[12:41:43] <joal>	 mforns: I can review if you wish
[12:41:54] <mforns>	 joal: I already merged :P
[12:42:08] <mforns>	 deploying
[12:42:12] <mforns>	 sorry for that
[12:42:46] <joal>	 np thanks mforns :)
[13:16:09] <mforns>	 joal: I'm having problems with the deploy :((( Actually I think it's not the deployment, it went fine, but Airflow is not updating the DAG code! I checked in the production DAG folder, and the code is updated. But the UI is still showing the old code... after 30 minutes!
[13:16:25] <joal>	 MEH :(
[13:16:59] <mforns>	 seems to me Airflow is stuck...
[13:18:25] <mforns>	 when I tried the code changes in the dev instance, they showed up fine, so I don't think it's a DAG interpretation issue either
[13:21:53] <mforns>	 I restarted Airflow, and it still shows the old code, and the sensors are not checking correctly I think, maybe a database connection thing?
[13:25:49] <joal>	 hm
[13:32:22] <joal>	 have you restarted all of airflow mforns ?
[13:32:59] <mforns>	 not sure, but I restarted the DAG file processing for sure, I can see it in the logs.
[13:33:16] <mforns>	 And since then, all DAGs except for 6, are not being read
[13:33:34] <mforns>	 oh. now 7
[13:33:47] <mforns>	 hm, it seems to be slowly catching up
[13:33:55] <mforns>	 but... this is weeiiirdd
[13:34:15] <joal>	 weird indeed!
[13:35:54] <joal>	 I'm looking at an-launcher1002 top - it's busy - not overwhelmed but busy
[13:36:38] <mforns>	 joal: how about an-coord1001?
[13:37:01] <mforns>	 is it 1001 or 1002 the mariadb one? looking
[13:37:12] <joal>	 an-coord1001 is fine
[13:37:49] <mforns>	 mysql in 1001 60% CPU, is that ok?
[13:39:32] <joal>	 mforns: the host is multi-core, so yes it's very ok - the host is not using it's cores
[13:39:49] <mforns>	 ah I see
[13:48:36] <joal>	 hm
[13:48:57] <joal>	 There are many airflow processes on an-launcher
[13:49:11] <joal>	 it doesn't lack memory and CPU is not 100%
[13:49:34] <mforns>	 yes
[13:49:46] <mforns>	 I think I found something interesting here:
[13:50:23] <mforns>	  /srv/airflow-analytics/logs/dag_processor_manager/dag_processor_manager.log
[13:51:13] <mforns>	 It seems it only manages to read part of the DAGs, most dags don't get interpreted
[13:51:51] <mforns>	 the DAG parsing happens every 30 seconds, and the timeout for interpreting the whole dag_bag is IIUC 30 seconds.
[13:52:38] <joal>	 hm
[13:52:42] <mforns>	 It says DAGs take about 1 second or less to parse.
[13:52:45] <joal>	 How come it doens't manage to do that?
[13:52:47] <joal>	 hm
[13:52:50] <mforns>	 yea
[13:52:53] <joal>	 probably not enough CPU left I asuume
[13:53:04] <joal>	 There are not that many jobs!
[13:53:48] <mforns>	 no...
[13:53:49] <mforns>	 17
[13:54:42] <joal>	 there seem to be reaaly a lot of thread/processes for airflow
[13:54:58] <joal>	 mforns: could it be that you have a test instance running there?
[13:55:05] <mforns>	 yes
[13:55:09] <wikibugs>	 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE Observability, and 2 others: Migrate analytics cluster alerts from Icinga to AlertManager - https://phabricator.wikimedia.org/T293399 (10fgiunchedi) >>! In T293399#7944047, @BTullis wrote: > Having reviewed this with the Data Engineer...
[13:55:10] <joal>	 would you stop it?
[13:56:03] <joal>	 mforns: many local executors currently running on the machione!
[13:56:47] <mforns>	 ok, killed the dev instance and all my airflow threads
[13:59:01] <mforns>	 oh! joal, all dags are parsed now!
[13:59:07] <joal>	 bingo
[13:59:25] <mforns>	 code is up to date
[13:59:32] <mforns>	 oh my gooood
[13:59:36] <joal>	 There are other things I'm sure (host too busy) - but it's a bad idea to use the production machine as a test env
[13:59:49] <joal>	 mforns: I'm gonna go to razzi's session :)
[13:59:53] <mforns>	 obvious... bad marcel!
[14:00:04] <mforns>	 👍
[14:01:49] <wikibugs>	 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE Observability, and 2 others: Migrate analytics cluster alerts from Icinga to AlertManager - https://phabricator.wikimedia.org/T293399 (10BTullis) Thanks @fgiunchedi - I'll create those follow-ups.  > May I ask what's the difference on...
[14:23:35] <wikibugs>	 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE Observability, and 2 others: Migrate analytics cluster alerts from Icinga to AlertManager - https://phabricator.wikimedia.org/T293399 (10fgiunchedi) >>! In T293399#7944787, @BTullis wrote: > Thanks @fgiunchedi - I'll create those foll...
[14:24:15] <wikibugs>	 10Analytics, 10Product-Analytics, 10SDAW-MediaSearch, 10Structured-Data-Backlog: No data from ptwikinews in event.mediawiki_mediasearch_interaction table - https://phabricator.wikimedia.org/T308815 (10CBogen)
[18:39:43] <awight>	 Is there public access to Kafka, or should third-party apps use EventStreams instead?
[18:50:42] <awight>	 Never mind—I was hoping to use an existing Kafka client but looking more closely I see that it doesn't support at-least-once processing out of the box so I might as well just keep improving the SSE client I've started already.
[22:33:24] <wikibugs>	 10Analytics, 10LDAP-Access-Requests, 10SRE: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10Dzahn) I think we should escalate this directly to the analytics team for advice how to move forward. Let me add them.
[22:33:28] <wikibugs>	 10Analytics, 10LDAP-Access-Requests, 10SRE: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10Dzahn)
[22:33:39] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, 10Superset: Help with data that's not appearing on charts - https://phabricator.wikimedia.org/T301895 (10Mayakp.wiki) Thanks @BTullis . We are currently working on updating Active Editor charts T307143 and will change the chart type as a p...
[23:47:15] <wikibugs>	 10Analytics, 10Data-Engineering-Radar, 10Event-Platform, 10Metrics-Platform, 10Browser-Support-Microsoft-Edge: Problem with delay caused by intake-analytics.wikimedia.org - https://phabricator.wikimedia.org/T295427 (10Downsize43) Not sure what table of mine you used for your test. The problem remains  on...