[02:57:32] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10wmfdata-python: Update anaconda-wmf's wmfdata-python to 1.4.0 - https://phabricator.wikimedia.org/T305067 (10nshahquinn-wmf)
[06:37:30] <joal>	 !log Kill skein test jobs in arn
[06:37:31] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:44:06] <joal>	 Good morning dcausse - sorry for the early ping
[06:44:49] <joal>	 dcausse: there is a bunch of old mjolnir jobs still up in the cluster - can I kill them?
[06:50:39] <joal>	 !log Kill rerun stuck oozie job
[06:50:40] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:53:55] <joal>	 !log killing old mjolnit jobs
[06:53:56] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:03:26] <joal>	 actually I can't kill the mjolnir jobs
[07:13:12] <joal>	 btullis: good morning - I could with some help when you get in please
[07:19:49] <joal>	 looks like something has happened yesterday at 19:30 on an-coord1001
[07:20:07] <joal>	 CPU usage pattern has very much changed
[07:23:06] <joal>	 And it's as if all our hive-server queries were stuck
[07:25:59] <joal>	 ok I think it's the HiveServer failing  - I can run a query using hive, but not beeline
[07:28:09] <joal>	 !log Restart HiveServer2 on an-coord1001 (I didn't even know I could do this)
[07:28:10] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:32:05] <joal>	 !log restart failed oozie jobs
[07:32:06] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:46:27] <joal>	 ok we're back in track, webrequest jobs will be catching up and the flow will restarts
[08:22:08] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Epic: Migrate all Cassandra Jobs - https://phabricator.wikimedia.org/T309995 (10JAllemandou) Shall this be resolved @EChetty ?
[08:43:42] <btullis>	 joal: I'm on the case now. Apologies for the delay. How can I best help?
[08:44:17] <joal>	 btullis: nothing urgent, the cluster is back in track I think
[08:44:48] <btullis>	 OK, cool. I'm looking into the additional sudoers group request from your email.
[08:45:00] <joal>	 thanks a lot, that will help :)
[08:49:43] <elukey>	 btullis: o/
[08:49:59] <btullis>	 Hello 
[08:50:18] <elukey>	 I noticed something weird yesterday while talking to Joseph, on the dse-k8s nodes I don't see the extra HDDs with lsblk or similar
[08:50:23] <elukey>	 I see only the SDDs
[08:50:44] <elukey>	 I quickly skimmed the procurement tasks and IIRC we didn't change the specs from ml-serve nodes
[08:50:47] <elukey>	 or did we?
[08:51:52] <elukey>	 IIRC https://phabricator.wikimedia.org/T286594 was the first procurement task for the nodes
[08:52:05] <btullis>	 Hmm. I don't remember the specific requirement for HDDs, personally. Checking now.
[08:52:51] <elukey>	 I mean we didn't really care for those since all the containers are running currently on SSDs, but if we want to use local disk for spark etc.. the HDDs may come handy
[08:53:03] <elukey>	 (they were 2x2TB in theory)
[08:53:27] <btullis>	 OK, so the first 4 nodes has 2 x 2TB drives. Let me check the second set of four.
[08:54:01] <elukey>	 how did you check?
[08:54:56] <btullis>	 Sorry, just checked the PDFs of the quote. Yep, definitely not mentioned on the second set of four servers. https://phabricator.wikimedia.org/T303432
[08:56:20] <elukey>	 ah snap yes in the second batch no mention of it, not even in the packing slip
[08:57:11] <btullis>	 If it's a big deal, why don't we see about moving some drives from an-presto servers to the dse-k8s-workers? Plenty of drives doing nothing in these boxes.
[08:57:36] <elukey>	 ah nono in the ml case it is super fine not having them, we don't need local storage on HDDs 
[08:57:42] <elukey>	 or we don't need a lot of it
[08:58:12] <elukey>	 but it was weird to see the difference
[08:58:42] <elukey>	 also for 1001-1004 we should have 8 disks of 2TBs, but maybe they are not shown because we need to configure them via PERC or similar
[08:59:19] <btullis>	 Oh, so you can't see the drives on 100[1-4] - Oh right, checking now.
[09:00:33] <elukey>	 but yeah if we don't have the extra drives on 1005-1008 it doesn't matter :D
[09:00:48] <btullis>	 Yeah, they're present. You can see them with `sudo megacli -PDList -aall` but they're not configured.
[09:01:03] <elukey>	 ahhh right didn't check via megacli, makes sense
[09:01:29] <elukey>	 okok they are there, but we'll likely not even consider them
[09:08:41] <wikibugs>	 10Data-Engineering: Grant analytics-admins the right to run commands as the yarn user - https://phabricator.wikimedia.org/T321378 (10BTullis)
[09:10:19] <wikibugs>	 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (Sprint 03): Grant analytics-admins the right to run commands as the yarn user - https://phabricator.wikimedia.org/T321378 (10BTullis) p:05Triage→03Medium
[09:10:22] <wikibugs>	 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (Sprint 03): Grant analytics-admins the right to run commands as the yarn user - https://phabricator.wikimedia.org/T321378 (10BTullis)
[09:11:18] <elukey>	 btullis: when you have a moment https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/845430
[09:11:54] <wikibugs>	 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (Sprint 03): Grant analytics-admins the right to run commands as the yarn user - https://phabricator.wikimedia.org/T321378 (10BTullis) Expediting this ticket in order to help with cluster performance troubleshooting.
[09:16:26] <elukey>	 btullis: ah snap sorry didn't see the other one :( I can amend mine, I thought to do both since I saw your comment on #machine-learning :(
[09:17:57] <btullis>	 It's all good. I didn't think to add the ml-staging cluster, so I'm more than happy that you did. 
[09:18:09] <elukey>	 ack thanks, going to merge then!
[09:18:38] <btullis>	 👍
[09:19:39] <btullis>	 joal: Here is the sudoers change for analytics-admins -> yarn that you requested. https://gerrit.wikimedia.org/r/c/operations/puppet/+/845462
[09:22:20] <joal>	 thanks a lot btullis :)
[09:51:28] <joal>	 actually ignore my first message about datahub ingestion - the problem was with the HiveServer
[09:53:16] <btullis>	 OK, good to know. We still don't actually know what caused the problem with the HiveServer, do we? You've mentioned noisy SASL messages in the logs, but are there any other clues yet?
[09:56:53] <wikibugs>	 (03PS7) 10Joal: Update mediawiki-history page and user computation [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/842922 (https://phabricator.wikimedia.org/T318589)
[09:58:12] <elukey>	 completely ignorant about the issue, but in https://grafana.wikimedia.org/d/000000379/hive?orgId=1&var-instance=an-coord1001&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=analytics&from=now-12h&to=now the GC timings look not great
[09:58:45] <elukey>	 (I see an event around 7:30 UTC, not sure if it is the same)
[09:59:39] <joal>	 elukey: it starts at the moment our jobs start not being able to access hive
[09:59:51] <joal>	 elukey: Nothing special I could find in the logs
[10:00:06] <elukey>	 joal: was it around 7:30 UTC? 
[10:00:16] <joal>	 elukey: 19:30 UTC
[10:00:38] <elukey>	 ah yes syre
[10:00:39] <joal>	 at 7:30 UTC today I restarted the server, which solved the problem
[10:00:39] <elukey>	 https://grafana.wikimedia.org/d/000000379/hive?orgId=1&var-instance=an-coord1001&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=analytics&from=now-24h&to=now
[10:00:47] <elukey>	 yep yep
[10:01:03] <elukey>	 gc time for old gen went up to more than a min
[10:01:06] <joal>	 I really wonder what have happened
[10:01:26] <elukey>	 so the oldgen was filled with garbage
[10:02:28] <joal>	 interesting elukey - the oldgen usage grows steadily from spet 27th
[10:03:09] <joal>	 feels like we have a memor leak or something
[10:03:16] <joal>	 cause the pattern repeats
[10:03:20] <icinga-wm>	 PROBLEM - Check unit status of check_webrequest_partitions on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit check_webrequest_partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:04:13] <joal>	 but, the old-gen space being full had not previsouly triggered crazy GC
[10:04:19] <joal>	 Only this time
[10:04:20] <btullis>	 We have this ticket, which is about a gradual memory leak in Hive, but this doesn't feel like the same pattern, does it? https://phabricator.wikimedia.org/T303168
[10:04:22] <joal>	 Weird
[10:05:04] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_webrequest_partitions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:05:07] <joal>	 I think it could very weel be related btullis 
[10:07:23] <joal>	 the problem of the check_webrequest_partition is related to the cluster being late refining webrequest it'll be back to normal in a few hours
[10:51:57] <wikibugs>	 (03PS1) 10Matthias Mullie: Add schema for Extension:SearchVue actions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/845498 (https://phabricator.wikimedia.org/T321069)
[10:52:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add schema for Extension:SearchVue actions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/845498 (https://phabricator.wikimedia.org/T321069) (owner: 10Matthias Mullie)
[10:52:36] <wikibugs>	 (03CR) 10Matthias Mullie: [C: 04-2] "DNM; awaiting confirmation for a couple of details." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/845498 (https://phabricator.wikimedia.org/T321069) (owner: 10Matthias Mullie)
[11:18:22] <wikibugs>	 (03PS2) 10Matthias Mullie: Add schema for Extension:SearchVue actions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/845498 (https://phabricator.wikimedia.org/T321069)
[11:19:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add schema for Extension:SearchVue actions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/845498 (https://phabricator.wikimedia.org/T321069) (owner: 10Matthias Mullie)
[11:41:33] <wikibugs>	 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 3 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Dreamy_Jazz)
[11:43:32] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Event-Platform Value Stream, and 3 others: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 (10Peter)
[12:30:36] <wikibugs>	 (03PS8) 10Joal: Update mediawiki-history page and user computation [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/842922 (https://phabricator.wikimedia.org/T318589)
[13:36:23] <wikibugs>	 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 3 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Dreamy_Jazz)
[13:36:43] <wikibugs>	 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 3 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Dreamy_Jazz)
[13:37:20] <wikibugs>	 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 3 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Dreamy_Jazz)
[13:41:43] <wikibugs>	 10Data-Engineering, 10Data-Catalog, 10Product-Analytics: Propagate field descriptions from event schemas to metastore - https://phabricator.wikimedia.org/T307040 (10Ottomata) This task is about the `event` tables in Hive, for which most fields are indeed created from the event schemas, but not all. If we pro...
[14:00:28] <wikibugs>	 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 3 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Ladsgroup) @Dreamy_Jazz Please wait for a bit. Running multiple data migration...
[14:11:27] <wikibugs>	 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Cmjohnson) The dns has been updated but I am not getting any mgmt connection, I need to check to make sure the mgmt cables are conne...
[14:23:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp4037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4037%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[14:26:05] <wikibugs>	 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 3 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Dreamy_Jazz) There is no particular rush on the subtasks to do with the checku...
[14:28:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp4037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4037%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:08:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp4049 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4049%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:13:08] <joal>	 elukey, btullis same pattern again in hive-server, starting at 13:38 :(
[15:13:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp4049 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4049%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:13:36] <joal>	 And with that, stuck jobs
[15:14:00] <btullis>	 joal: Oh no. I'm in a meeting right now, but I can jump out if critical now.
[15:14:15] <joal>	 all good btullis - I'll manage for now
[15:14:49] <btullis>	 Thank you. I'm here if needed.
[15:18:02] <elukey>	 :(
[15:18:05] <joal>	 !log restart hive-server2 service
[15:18:07] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:20:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp4049 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4049%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:20:20] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines (Sprint 03): [airflow] Normalize the use of timeouts in Airflow DAGs - https://phabricator.wikimedia.org/T317549 (10mforns) We had a discussion with the team and here's a summary:  **tl;dr** We won't use timeouts for now.  ---  Use cases of timeouts --- We identifi...
[15:21:12] <elukey>	 joal: next time maybe let's get a thread dump or a heap dump so we know what's happening
[15:25:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp4049 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4049%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:28:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp4049 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4049%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:32:32] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines (Sprint 03): [airflow] Normalize the use of timeouts in Airflow DAGs - https://phabricator.wikimedia.org/T317549 (10JArguello-WMF) @mforns Can you please let me know when you add the decision to the Airflow developer guide? So I can close the ticket. Thanks!
[15:33:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp4049 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4049%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:38:56] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines (Sprint 03): [airflow] Normalize the use of timeouts in Airflow DAGs - https://phabricator.wikimedia.org/T317549 (10mforns) This is the MR that removes the timeout from the existing DAGs: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requ...
[15:45:43] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for igwikiquote - https://phabricator.wikimedia.org/T314639 (10Ladsgroup) DBA side is done
[15:46:35] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for bclwikiquote - https://phabricator.wikimedia.org/T316456 (10Ladsgroup) DBA side is done
[15:47:07] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for tlwikiquote - https://phabricator.wikimedia.org/T317111 (10Ladsgroup) DBA side is done
[15:47:44] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for bnwikiquote - https://phabricator.wikimedia.org/T319190 (10Ladsgroup) DBA side is done
[15:50:45] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines (Sprint 03): [airflow] Normalize the use of timeouts in Airflow DAGs - https://phabricator.wikimedia.org/T317549 (10mforns) Here's the documentation about timeouts in Airflow's developer guide: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow/Developer...
[16:07:45] <wikibugs>	 10Data-Engineering-Planning, 10Data-Catalog, 10Event-Platform Value Stream: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10odimitrijevic)
[16:09:14] <wikibugs>	 10Data-Engineering, 10Data-Catalog, 10Product-Analytics: Propagate field descriptions from event schemas to metastore - https://phabricator.wikimedia.org/T307040 (10odimitrijevic) @Ottomata makes sense. Thanks for posting the ticket
[16:22:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4047%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[16:27:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4047%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[16:28:07] <sukhe>	 ^ this is fine, host is not pooled
[16:38:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4047%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[16:38:28] <joal>	 jobs are still stuc :(
[16:38:36] <joal>	 there is a problem of locking
[16:39:24] <joal>	 I think there is a job that locks Hive mysql DB tables, and doesn't release them, and this blocks all other jobs
[16:40:52] <btullis>	 sukhe: Thanks for the update. It's still a problem that the check is supposed to take account of de-pooled hosts. However, we're still getting these false positives. 
[16:41:46] <btullis>	 joal: Do you want to look together? Maybe we can see which processes might have locks open on the mariadb servers or something.
[16:42:01] <joal>	 btullis: why not I don't really know how to go with that
[16:43:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4047%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[16:49:48] <btullis>	 !log restarting hue on an-tool1009
[16:49:49] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:55:15] <btullis>	 !log restarting hive-server2 service on an-coord1001
[16:55:16] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:57:25] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10odimitrijevic) @Cmjohnson Thank you!
[17:20:47] <joal>	 gone for diner - will be back after to double check
[18:40:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4047%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[18:45:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4047%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[18:54:16] <wikibugs>	 10Analytics-Wikistats, 10Data-Engineering-Planning, 10Data Pipelines: [Wikistats] Add newly translated languages - https://phabricator.wikimedia.org/T311315 (10Milimetric) Thanks @Aftabuzzaman, I didn't know about Bengali.  Releasing a new language is a manual process at the moment.  I'm building and deployi...
[18:54:43] <wikibugs>	 (03PS1) 10Milimetric: Enable Bengali and deploy 2.9.8 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/845670
[18:57:26] <wikibugs>	 (03Abandoned) 10Milimetric: Enable Bengali and deploy 2.9.8 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/845670 (owner: 10Milimetric)
[18:58:00] <wikibugs>	 10Analytics-Jupyter, 10Data-Engineering-Planning, 10Product-Analytics, 10Patch-For-Review: Change puppet jupyterhub module to point to conda-analytics - https://phabricator.wikimedia.org/T321088 (10xcollazo) Attached gerrit patch was tested and debugged today on `stat1007`. Thanks @Ottomata for the help!...
[18:58:29] <wikibugs>	 10Analytics-Jupyter, 10Data-Engineering-Planning, 10Product-Analytics, 10Patch-For-Review: Change puppet jupyterhub module to point to conda-analytics - https://phabricator.wikimedia.org/T321088 (10xcollazo) 05Open→03In progress
[18:58:31] <wikibugs>	 10Analytics-Jupyter, 10Data-Engineering, 10Product-Analytics: Replace anaconda-wmf with smaller, non-stacked Conda environments - https://phabricator.wikimedia.org/T302819 (10xcollazo)
[19:01:34] <wikibugs>	 10Analytics-Jupyter, 10Data-Engineering, 10Product-Analytics, 10Data Pipelines (Sprint 03), 10Patch-For-Review: Change puppet jupyterhub module to point to conda-analytics - https://phabricator.wikimedia.org/T321088 (10xcollazo)
[19:02:00] <wikibugs>	 10Analytics-Jupyter, 10Data-Engineering, 10Product-Analytics, 10Data Pipelines (Sprint 03), 10Patch-For-Review: Change puppet jupyterhub module to point to conda-analytics - https://phabricator.wikimedia.org/T321088 (10xcollazo)
[19:02:08] <wikibugs>	 10Analytics-Jupyter, 10Data-Engineering, 10Product-Analytics, 10Data Pipelines (Sprint 03), 10Patch-For-Review: Change puppet jupyterhub module to point to conda-analytics - https://phabricator.wikimedia.org/T321088 (10xcollazo) a:03xcollazo
[19:12:32] <wikibugs>	 (03PS1) 10Milimetric: Release 2.9.8 with Bengali support [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/845676
[19:12:49] <wikibugs>	 (03CR) 10Milimetric: [C: 03+2] Release 2.9.8 with Bengali support [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/845676 (owner: 10Milimetric)
[19:15:00] <wikibugs>	 (03Merged) 10jenkins-bot: Release 2.9.8 with Bengali support [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/845676 (owner: 10Milimetric)
[19:26:13] <wikibugs>	 10Analytics-Wikistats, 10Data-Engineering-Planning, 10Data Pipelines: [Wikistats] Add newly translated languages - https://phabricator.wikimedia.org/T311315 (10Milimetric) Ok, looks good, please check and let me know.  If any other language is ready, just file a task and let us know.  We'd have no way of kno...
[20:08:28] <wikibugs>	 10Analytics-Jupyter, 10Data-Engineering, 10Product-Analytics, 10Data Pipelines (Sprint 03), 10Patch-For-Review: Add support for jupyterhub on conda-analytics - https://phabricator.wikimedia.org/T321088 (10xcollazo)
[20:15:47] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream (Sprint 03), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) @daniel, Q about page suppress vs delete.    As far as I can tell, a page delete is a delete,...
[20:20:09] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines (Sprint 03): [airflow] Normalize the use of timeouts in Airflow DAGs - https://phabricator.wikimedia.org/T317549 (10xcollazo) For clarification: Is this decision just for the `analytics` instance, or for all instances?
[20:30:43] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-Core-Hooks: Add $comment to ArticleRevisionVisibilitySet params - https://phabricator.wikimedia.org/T321411 (10Ottomata)
[20:31:52] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-Core-Hooks: Add $comment to ArticleRevisionVisibilitySet params - https://phabricator.wikimedia.org/T321411 (10Ottomata) I'm not sure if providing the reason to the hook for a revision suppression is a security concern.  It is provided during ful...
[20:35:22] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-Core-Hooks: Add $comment to ArticleRevisionVisibilitySet params - https://phabricator.wikimedia.org/T321411 (10Ottomata) Actually, having the LogEntry passed to the ArticleRevisionVisibilitySet might be nice, because then we could use the LogEntr...
[20:42:27] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-Core-Hooks: Create PageUndeleteComplete hook, analogous to PageDeleteComplete - https://phabricator.wikimedia.org/T321412 (10Ottomata)
[21:47:29] <wikibugs>	 10Analytics-Wikistats, 10Data-Engineering-Planning, 10Data Pipelines: [Wikistats] Add newly translated languages - https://phabricator.wikimedia.org/T311315 (10Aftabuzzaman) Thanks :)