[05:04:40] <wikibugs>	 (03PS1) 10Sharvaniharan: Add a required field in mobile_apps fragment Added a new required field in a new fragment that will be used only by the android app. We will maintain any future android related fields here. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/766893
[05:05:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add a required field in mobile_apps fragment Added a new required field in a new fragment that will be used only by the android app. We will maintain any future android related fields here. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/766893 (owner: 10Sharvaniharan)
[05:07:10] <wikibugs>	 (03CR) 10Sharvaniharan: "Made this Patch to take care of CR comments on this: https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/765671/3/jsonschema/analyt" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/766893 (owner: 10Sharvaniharan)
[05:44:24] <wikibugs>	 (03PS1) 10Sharvaniharan: Add a required field in mobile_apps fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/766897
[05:47:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add a required field in mobile_apps fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/766897 (owner: 10Sharvaniharan)
[05:59:38] <wikibugs>	 (03PS2) 10Sharvaniharan: Add a required field in mobile_apps fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/766897
[06:00:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add a required field in mobile_apps fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/766897 (owner: 10Sharvaniharan)
[06:10:19] <wikibugs>	 (03PS3) 10Sharvaniharan: Add a required field in mobile_apps fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/766897
[06:30:31] <wikibugs>	 (03CR) 10Sharvaniharan: "Made this patch to address the CR comments on : https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/765671/1" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/766897 (owner: 10Sharvaniharan)
[06:38:54] <elukey>	 good morning folks
[06:39:11] <elukey>	 I saw an alert for an-coord1002, the /srv partition is getting filled
[06:39:41] <elukey>	 an-coord1001 is also getting full but still in a green zone (~80%)
[06:39:58] <elukey>	 I see a lot of binlog files under /srv/sqldata for analytics-meta
[06:40:10] <wikibugs>	 (03Abandoned) 10Sharvaniharan: Add a required field in mobile_apps fragment Added a new required field in a new fragment that will be used only by the android app. We will maintain any future android related fields here. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/765671 (https://phabricator.wikimedia.org/T299239) (owner: 10Sharvaniharan)
[06:40:37] <wikibugs>	 (03Abandoned) 10Sharvaniharan: Add a required field in mobile_apps fragment Added a new required field in a new fragment that will be used only by the android app. We will maintain any future android related fields here. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/766893 (owner: 10Sharvaniharan)
[06:43:02] <elukey>	 and from a quick check in binlogs it seems airflow
[06:44:25] <elukey>	 elukey@an-coord1002:/srv/sqldata$ sudo mysqlbinlog -vv --base64-output=DECODE-ROWS --skip-ssl  analytics-meta-bin.001114 | grep  "UPDATE sla_miss" | wc -l
[06:44:28] <elukey>	 1388685
[06:47:26] <elukey>	 is airflow currently doing some kind of backfill?
[06:49:48] <elukey>	 this is from a random binlog
[06:49:48] <elukey>	 UPDATE sla_miss SET timestamp='2022-02-28 00:39:34.591222' WHERE sla_miss.task_id = 'should_alert' AND sla_miss.dag_id = 'traffic_distribution' AND sla_miss.execution_date = '2022-02-17 00:00:00'
[06:50:34] <elukey>	 if I grep for "2022-02-17" when doing the mysqlbinlog dump I see the line repeated a lot of times
[06:51:15] <elukey>	 it seems as if airflow keeps checking a series of days for its dags, updating the related mysql row accordingly
[06:51:29] <elukey>	 so the table space used is not a problem, it is not inserting new data
[06:51:35] <elukey>	 but the update is registered in the binlog
[06:51:40] <elukey>	 that grows
[09:06:54] <elukey>	 going to open a task about it
[09:10:55] <wikibugs>	 10Data-Engineering, 10Airflow, 10Epic, 10Platform Team Workboards (Image Suggestion API): Airflow collaborations - https://phabricator.wikimedia.org/T282033 (10gmodena)
[09:13:25] <gehel>	 joal: around?
[09:13:43] <gehel>	 Have time to jump in https://meet.google.com/zfv-rnjk-xdo with me an tanny411 
[09:16:55] <elukey>	 gehel: I think he is afk at the moment
[09:17:06] <gehel>	 yeah, that's what it looks like
[09:17:09] <gehel>	 No emergency
[09:17:56] <gehel>	 joal: for when you're back: we were discussing with tanny411 about which repository to use to push her code. Having some place to share code would help do reviews and start organizing.
[09:18:21] <wikibugs>	 10Data-Engineering: The analytics-meta's binlogs are full of airflow sla-related UPDATE statements - https://phabricator.wikimedia.org/T302777 (10elukey) p:05Triage→03High
[09:18:22] <gehel>	 My assumption is that this should follow whatever practices exist on the Data Engineering side, so I'll let you tell us how it works!
[09:28:43] <wikibugs>	 10Analytics: Check home/HDFS leftovers of zpapierski - https://phabricator.wikimedia.org/T302779 (10MoritzMuehlenhoff)
[09:48:54] <elukey>	 !log elukey@stat1004:~$ sudo kill `pgrep -u zpapierski` (offboarded user, puppet broken on the host)
[09:48:56] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:02:19] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: The analytics-meta's binlogs are full of airflow sla-related UPDATE statements - https://phabricator.wikimedia.org/T302777 (10BTullis) a:03BTullis
[10:09:18] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: The analytics-meta's binlogs are full of airflow sla-related UPDATE statements - https://phabricator.wikimedia.org/T302777 (10BTullis) Thanks @elukey. I'll take this task and start working on it now, but I get the feeling that it's going to be best to loop in all...
[10:10:59] <btullis>	 elukey: Thanks for the rapid analysis on the Airflow/MariaDB incident. I will look into this now.
[10:12:54] <elukey>	 btullis: np! just added a comment about /srv/an-coord1001-backup, it may be dropped to free space on 1002
[10:12:55] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: The analytics-meta's binlogs are full of airflow sla-related UPDATE statements - https://phabricator.wikimedia.org/T302777 (10elukey) Yep yep makes sense, for the moment we can simply free some space on an-coord1002 and purge some binary logs in case it is needed....
[10:19:03] <btullis>	 !log btullis@an-coord1002:/srv$ sudo rm -rf an-coord1001-backup/ (#T302777)
[10:19:05] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:19:06] <stashbot>	 T302777: The analytics-meta's binlogs are full of airflow sla-related UPDATE statements - https://phabricator.wikimedia.org/T302777
[10:20:43] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: The analytics-meta's binlogs are full of airflow sla-related UPDATE statements - https://phabricator.wikimedia.org/T302777 (10BTullis) Yes, good thinking. I have cleared the `/srv/an-coord1001-backup` directory as suggested. ` btullis@an-coord1002:/sr...
[10:35:53] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: The analytics-meta's binlogs are full of airflow sla-related UPDATE statements - https://phabricator.wikimedia.org/T302777 (10BTullis) We're currently keeping 14 days' worth of logs on an-coord1002. ` btullis@an-coord1002:/srv/sqldata$ grep expire /et...
[10:41:30] <wikibugs>	 (03PS29) 10Phuedx: Metrics Platform event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan)
[10:41:46] <wikibugs>	 (03CR) 10Phuedx: Metrics Platform event schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan)
[10:43:54] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: The analytics-meta's binlogs are full of airflow sla-related UPDATE statements - https://phabricator.wikimedia.org/T302777 (10BTullis) I'd estimate that the growth began about 12 days ago, according to this disk usage graph. {F34970967,width=60%} [[ht...
[10:51:53] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: The analytics-meta's binlogs are full of airflow sla-related UPDATE statements - https://phabricator.wikimedia.org/T302777 (10BTullis) That doesn't leave us very long though. By a rough guess that extra 8GB of space would get us back to 100% full by m...
[10:57:14] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: The analytics-meta's binlogs are full of airflow sla-related UPDATE statements - https://phabricator.wikimedia.org/T302777 (10BTullis) On an-coord1001 we would have about more like 10 days to fix the issue, before we hit 100%. {F34970989,width=60%} [[...
[11:28:01] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: The analytics-meta's binlogs are full of airflow sla-related UPDATE statements - https://phabricator.wikimedia.org/T302777 (10BTullis) I can confirm that the UPDATE sla_miss is still happening. It is coming from the `traffic_distibution` DAG. {F349710...
[11:37:44] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: The analytics-meta's binlogs are full of airflow sla-related UPDATE statements - https://phabricator.wikimedia.org/T302777 (10BTullis) Scheduler logs for this job can be viewed in realtime with the command on an-launcher1002: ` tail -f /srv/airflow-an...
[11:42:46] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: The analytics-meta's binlogs are full of airflow sla-related UPDATE statements - https://phabricator.wikimedia.org/T302777 (10BTullis) Looking more closely at the runs in Ariflow, it looks like there is a run from Feb 12th that is still running, 308 h...
[12:14:40] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: The analytics-meta's binlogs are full of airflow sla-related UPDATE statements - https://phabricator.wikimedia.org/T302777 (10BTullis) I'm not sure where to look next for a running job, if indeed there is one. I can tell from [[https://github.com/wiki...
[12:16:57] <btullis>	 mforns: I could do with your view on T302777 whenever you're around please. How should I handle what appears to be a stuck Airflow job from Feb 12th?
[12:16:58] <stashbot>	 T302777: The analytics-meta's binlogs are full of airflow sla-related UPDATE statements - https://phabricator.wikimedia.org/T302777
[12:28:32] <joal>	 Hi gehel and tanny411 - I was teaching this morning
[12:30:51] <tanny411>	 joal: Would you be available tomorrow sometime?
[12:31:40] <joal>	 tanny411: would 14:00 my time work?
[12:36:41] <tanny411>	 joal: I can actually do after 17:00. Which now I realize is too late for you, plus it's Wednesday. 
[12:36:41] <tanny411>	 Let's meet on our usual time in Thursday. No hurry. And thanks!
[12:37:29] <joal>	 we can do 19:00 my time tomorrow if you wish, it'll allow not to have the Thursday one :)
[12:40:40] <tanny411>	 joal: Yes, works for me :)
[12:40:49] <joal>	 ok let's go :)
[12:40:53] <joal>	 thanks tanny411 :)
[13:08:26] <wikibugs>	 10Analytics, 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10QuickSurveys, and 3 others: QuickSurveys should show an error when response is blocked - https://phabricator.wikimedia.org/T256463 (10thiemowmde)
[13:36:15] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Metrics Platform event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan)
[13:37:23] <wikibugs>	 (03CR) 10Ottomata: Add a required field in mobile_apps fragment (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/766897 (owner: 10Sharvaniharan)
[13:37:53] <ottomata>	 joal:  o/
[13:38:06] <joal>	 hi ottomata  - in meeting - talk in a bit?
[13:38:14] <ottomata>	 before i dive into other things, can I help with the gobblin metrics in any way?  (yes of course!)
[13:39:00] <joal>	 ottomata: no news on gobblin-metrics for now - I'd like to pick your brain on the POC I've done with Flink please
[13:39:13] <ottomata>	 oh ho okay
[13:39:34] <ottomata>	 joal: lets talk gobblin metrics too, i might be able to find some time to work on it, if i may
[13:39:41] <joal>	 sure!
[14:01:27] <joal>	 ottomata: now?
[14:02:41] <ottomata>	 ya!
[14:05:58] <icinga-wm>	 PROBLEM - statsv Varnishkafka log producer on cp6010 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[14:06:53] <wikibugs>	 (03PS1) 10Joal: [WIP] Add prometheus metrics reporter [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/767178
[14:07:45] <icinga-wm>	 RECOVERY - statsv Varnishkafka log producer on cp6010 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[14:09:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add prometheus metrics reporter [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/767178 (owner: 10Joal)
[14:15:19] <wikibugs>	 (03PS2) 10Joal: [WIP] Add prometheus metrics reporter [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/767178 (https://phabricator.wikimedia.org/T294420)
[14:19:51] <mforns>	 btullis: hi
[14:20:03] <btullis>	 Hi mforns.
[14:20:21] <mforns>	 I read the task, and makes sense. I'm currently working on a fix for that job, but in the meantime we can deactivate it
[14:20:36] <mforns>	 btullis: I just did
[14:20:51] <btullis>	 Cool. Did you do it from the Airflow UI?
[14:20:54] <mforns>	 yes
[14:21:36] <mforns>	 I also cleared the dag_run that was active, so it stops trying to run
[14:22:09] <btullis>	 OK, got it. Thanks. I was wondering whether it would be better to mark it as failed, or to clear it.
[14:22:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add prometheus metrics reporter [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/767178 (https://phabricator.wikimedia.org/T294420) (owner: 10Joal)
[14:22:38] <mforns>	 when we re-enable that job, I'll re-run everything that failed
[14:23:39] <mforns>	 btullis: is there something else I can do? With the logs config or anything?
[14:25:49] <btullis>	 I think that in terms of the incident we're OK. I can see that the number of UPDATEs has dropped right down, so this isn't going to threaten the disk space any more.
[14:25:51] <btullis>	 https://usercontent.irccloud-cdn.com/file/Hjsi1atg/image.png
[14:27:07] <btullis>	 Just wondering whether there's anything that ought to be tweaked in terms of the SLA frequency checking (approx 1 second) or anything else we could to in terms of avoiding this kind of issue in future.
[14:28:44] <elukey>	 +1 to --^, checking every second seems to be very aggressive
[14:29:04] <mforns>	 oh yea, was checking now
[14:29:28] <btullis>	 This was causing about 80 database writes per second, which isn't huge by any means though. So maybe everything is fine. Maybe we just need to take allowance for what happens when we get a job that has blocked lots in front of it.
[14:31:16] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: The analytics-meta's binlogs are full of airflow sla-related UPDATE statements - https://phabricator.wikimedia.org/T302777 (10BTullis) @mforns has cleared the job, which has caused the SLA miss write storm to abate. {F34971284, width=60%} We're now in...
[14:37:22] <ottomata>	 joal:  i think i'm going to need some more help when you have time, i can see how to get coding, but not how to run or test
[14:37:30] <ottomata>	 i'd like to make some tests locally to see if it can just be run
[14:37:35] <ottomata>	 without having to run in hadoop
[14:38:06] <ottomata>	 dunno if that is going to be possible
[14:38:31] <mforns>	 btullis: I have not found any reference to sla_miss update interval in airflow docs, or the internet.
[14:42:26] <mforns>	 the only config offered by airflow is to switch on or off the sla_checks
[14:47:56] <btullis>	 mforns: Yes I see. Oh well, nothing else to do, I suppose. Unless you can think of any way to improve it. Should we have seen that this job needed clearing before this?
[14:48:09] <milimetric>	 joal: want to talk api endpoint for WiViVi data?
[14:48:55] <mforns>	 btullis: yes, definitely, we should have stopped the job as soon as we saw the error, my bad
[14:55:14] <btullis>	 No, not your bad I don't think. Just shared learnings along the way.
[14:55:18] <btullis>	 :-)
[15:35:43] <wikibugs>	 (03CR) 10Bearloga: [C: 04-1] "- Queries should be formatted according to https://www.mediawiki.org/wiki/Product_Analytics/Style_guide#SQL" [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/766196 (https://phabricator.wikimedia.org/T295332) (owner: 10Mayakpwiki)
[16:02:54] <wikibugs>	 10Analytics: Check home/HDFS leftovers of ema - https://phabricator.wikimedia.org/T302815 (10MoritzMuehlenhoff)
[16:11:36] <milimetric>	 I see nothing on the deploy etherpad, so skipping this week: https://etherpad.wikimedia.org/p/analytics-weekly-train
[17:00:30] <milimetric>	 a-team: standup
[17:00:43] <joal>	 joining in a minute (browser retart)
[17:14:30] <wikibugs>	 10Data-Engineering-Kanban, 10Data-Catalog: datahubsearch nodes alerting with "Rate of JVM GC Old generation-s runs" - https://phabricator.wikimedia.org/T302818 (10razzi)
[17:16:08] <wikibugs>	 10Data-Engineering, 10Product-Analytics: Consider not using anaconda as base conda environment - https://phabricator.wikimedia.org/T302819 (10Ottomata)
[17:28:30] <wikibugs>	 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Set up opensearch cluster for datahub - https://phabricator.wikimedia.org/T301382 (10BTullis) Looks good, but I think that we need to sort out the firewall between these hosts.  ` btullis@datahubsearch1001:/etc/opensearch/datahub$ curl http://127...
[17:35:54] <wikibugs>	 (03CR) 10Tchanders: [C: 03+2] "Tested again as outlined on PS42 - looks good" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte)
[17:36:36] <wikibugs>	 (03Merged) 10jenkins-bot: Basic ipinfo instrument setup [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte)
[17:48:51] <wikibugs>	 10Data-Engineering, 10Product-Analytics: Consider not using anaconda as base conda environment - https://phabricator.wikimedia.org/T302819 (10mpopov) Gathered some stats on Python package importing from all users' readable notebooks across stat1004–stat1008:  |package          | n_users| n_notebooks| |--------...
[17:57:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[18:06:58] <wikibugs>	 10Data-Engineering, 10Product-Analytics: Consider not using anaconda as base conda environment - https://phabricator.wikimedia.org/T302819 (10Ottomata) Awesome thank you!  I wonder how many of these are importing from anaconda-wmf, and how many are importing from the user conda envs (e.g., new versions have be...
[18:17:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[18:28:01] <wikibugs>	 10Analytics-Kanban, 10Data-Engineering: Move the Analytics/DE testing infrastructure to Pontoon - https://phabricator.wikimedia.org/T292388 (10BTullis) Moving back to the backlog for now.
[18:34:22] <razzi>	 !log demo irc logging to data eng team members
[18:34:24] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:36:18] <joal>	 Heya milimetric - available for a chat?
[18:38:02] <SandraEbele>	 !log sandra testing
[18:38:04] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:38:45] <joal>	 ottomata: quick chat?
[18:48:12] <wikibugs>	 10Data-Engineering, 10Product-Analytics: kerberos::systemd_timer should have a smarter default for syslog_identifier - https://phabricator.wikimedia.org/T302533 (10ldelench_wmf) p:05Triage→03Low
[18:57:54] <wikibugs>	 10Data-Engineering, 10Product-Analytics: kerberos::systemd_timer should have a smarter default for syslog_identifier - https://phabricator.wikimedia.org/T302533 (10Mayakp.wiki) The issue we faced due to syslog_identifier not having `$title` as the default value was resolved in T295733. Hence, this task has bee...
[19:15:00] <joal>	 gone for tonight team
[19:33:56] <mforns>	 mgharf... airflow cli overrides only work to trigger a single DAG run, not to start a permanent DAG... :(
[19:37:42] <milimetric>	 sorry jo was afk, next time
[19:38:34] <ottomata>	 joal:  ah sorry!
[19:38:36] <ottomata>	 i missed you rping!
[19:38:39] <ottomata>	 soryryYyy
[19:54:13] <milimetric>	 mforns: when do you have some time for me & airflow docs
[19:59:29] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[20:00:24] <mforns>	 milimetric: hi! now?
[20:00:38] <milimetric>	 yeah mforns if you're still working omw cave
[20:00:43] <mforns>	 ok, omw
[20:10:03] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[20:12:39] <icinga-wm>	 PROBLEM - cache_text: Varnishkafka webrequest Delivery Errors per second -drmrs- on alert1001 is CRITICAL: 5.35 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=drmrs+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All
[20:20:47] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka webrequest Delivery Errors per second -drmrs- on alert1001 is CRITICAL: 6.083 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=drmrs+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_upload&var-instance=All
[20:45:25] <icinga-wm>	 PROBLEM - cache_text: Varnishkafka webrequest Delivery Errors per second -drmrs- on alert1001 is CRITICAL: 5.083 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=drmrs+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All
[21:04:23] <icinga-wm>	 PROBLEM - cache_text: Varnishkafka webrequest Delivery Errors per second -drmrs- on alert1001 is CRITICAL: 5.233 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=drmrs+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All
[21:19:26] <wikibugs>	 10Data-Engineering, 10Product-Analytics: Consider not using anaconda as base conda environment - https://phabricator.wikimedia.org/T302819 (10mpopov) @Ottomata it occurred to me that for the purposes of finding out what's likely to be shipped to worker nodes a more useful list would perhaps be a list of import...
[21:37:01] <wikibugs>	 10Data-Engineering, 10Product-Analytics: Consider not using anaconda as base conda environment - https://phabricator.wikimedia.org/T302819 (10Ottomata) Right, but when those imports happen, how many are importing from e.g. more recent versions installed into the user's conda env, vs from the base anaconda-wmf...
[22:00:15] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Use arrow_hdfs:// fsspec protocol in workflow_utils artifact syncing - https://phabricator.wikimedia.org/T300876 (10Ottomata)
[22:03:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[22:08:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[23:08:15] <jinxer-wm>	 (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org
[23:13:15] <jinxer-wm>	 (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org