[01:44:02] <icinga-wm>	 PROBLEM - cache_text: Varnishkafka webrequest Delivery Errors per second -drmrs- on alert1001 is CRITICAL: 5.617 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=drmrs+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All
[03:50:54] <icinga-wm>	 PROBLEM - cache_text: Varnishkafka webrequest Delivery Errors per second -drmrs- on alert1001 is CRITICAL: 5.667 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=drmrs+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All
[04:19:22] <icinga-wm>	 PROBLEM - cache_text: Varnishkafka webrequest Delivery Errors per second -drmrs- on alert1001 is CRITICAL: 5.617 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=drmrs+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All
[06:44:46] <elukey>	 good morning folks, these are for the marseille dc that has some networking issue (and not live yet)
[08:09:00] <joal>	 ack elukey - thanks for the update - we'll triple check our webrequest-alerts with that in mind :)
[08:40:53] <elukey>	 joal: bonjour
[08:41:23] <joal>	 Hi elukey :)
[08:48:11] <btullis>	 Morning all. I'm back from my hols.
[08:48:28] <joal>	 Hi btullis - welcome back :)
[08:48:42] <joal>	 btullis: How was time off?
[08:48:55] <btullis>	 I'm going to fix that hive JVM heap error that keeps firing on the test cluster.
[08:49:03] <joal>	 thank you btullis 
[08:50:07] <btullis>	 joal: Well, as with any holiday with young kids, absolutely exhausting. Engineered specifically to drive you to the edge of insanity. 
[08:50:16] <btullis>	 But still good fun.
[08:50:20] <joal>	 I can relate to that :)
[09:23:19] <icinga-wm>	 PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:34:23] <icinga-wm>	 RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:58:29] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka webrequest Delivery Errors per second -drmrs- on alert1001 is CRITICAL: 5.222 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=drmrs+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_upload&var-instance=All
[11:16:35] <milimetric>	 lol ben, I can vaguely remember what vacations without kids were, freedom is a foggy and distant memory
[11:19:03] <milimetric>	 thanks aqu for taking care of those airflow jobs, I'll update docs and might poke you for some details
[13:43:01] <mforns>	 aqu: thanks for looking into the airflow deadlock problem! I reviewed your fix and left a couple comments.
[13:53:42] <ottomata>	 o/
[13:54:00] <btullis>	 ottomata: Welcome back!
[13:55:17] <ottomata>	 htank youUuUU
[13:55:36] * ottomata starts sharpening email machete
[13:55:42] <ottomata>	 mforns: https://github.com/fsspec/filesystem_spec/issues/874#issuecomment-1048578621
[14:04:55] <mforns>	 heya ottomata :] welcome back! I didn't get if they are updating to use the new pyarrow interface, or if they are replacing pyarrow with fsspec?
[14:05:57] <ottomata>	 uh, my understanding is they are going to make fsspec use the new pyarrow interface by default, by removing support for the deprecated pyarrow interface
[14:06:09] <mforns>	 ok, got it
[14:19:52] <wikibugs>	 (03CR) 10Ottomata: Metrics Platform event schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan)
[14:25:45] <wikibugs>	 (03CR) 10Ottomata: "Couple of nits, but LGTM otherwise!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/765671 (https://phabricator.wikimedia.org/T299239) (owner: 10Sharvaniharan)
[14:35:59] <wikibugs>	 10Analytics, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review, 10Readers-Web-Backlog (Kanbanana-FY-2021-22): WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Ottomata) Looks right to me!  We'll also need to merge https://gerrit.wikimedia.org/r/c/schemas/even...
[14:41:15] <wikibugs>	 10Analytics, 10Data-Engineering, 10Pageviews-API: Track page views by page ID rather than title (handles moved pages) - https://phabricator.wikimedia.org/T159046 (10Isaac) @EChetty is there an explanation for why this task was declined? As far as I know, it's still not possible to query the pageviews APIs wi...
[14:50:25] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add wikis to sqoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/765578 (https://phabricator.wikimedia.org/T299548) (owner: 10Milimetric)
[14:50:39] <milimetric>	 deploying refinery so we're in time for the monthly sqoop
[14:59:45] <milimetric>	 mforns: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/765314 duplicates some of the queries and table create statements, and I don't see an explanation on the change, kind of curious why we're going that way
[15:00:16] <milimetric>	 is this part of the airflow migration?  (like we will eventually remove the originals?)
[15:01:35] <milimetric>	 !log deploying new wikis to sqoop list ahead of sqoop job starting in a few hours
[15:01:37] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:16:19] <mforns>	 milimetric: yes! we're just moving it in 3 steps, 1) duplicate the queries, 2) Migrate jobs, 3) remove old queries.
[15:17:06] <milimetric>	 gotcha mforns.  In that case, part of step 1 should probably be to update the old queries with something like "NOTE: these are in the process of being moved to Airflow, see {folder X} if you want to make any changes"
[15:17:27] <mforns>	 OK, makes sense
[15:17:54] <mforns>	 milimetric: do you want us to update those before you deploy?
[15:22:36] <milimetric>	 mforns: nono, I already started, it's not urgent.  I just think it'll get really confusing over the next few months as we move stuff slowly and half the jobs are in one situation, the other half in another, and a problem happens
[15:26:55] <mforns>	 milimetric: yes we'll do it
[16:00:40] <milimetric>	 !log refinery done deploying and syncing, new sqoop list is up
[16:00:41] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:13:52] <wikibugs>	 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: APIs - https://phabricator.wikimedia.org/T300028 (10Snwachukwu) a:03Snwachukwu
[16:14:37] <wikibugs>	 10Data-Engineering, 10Airflow: Low Risk Ozzie Migration: 4 wikidata metrics jobs - https://phabricator.wikimedia.org/T300021 (10Snwachukwu) a:03Snwachukwu
[16:24:26] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Define the Kubernetes Deployments for Datahub - https://phabricator.wikimedia.org/T301454 (10BTullis) I'm significantly further forward I think, but I'm now at the point where I need guidance in terms of setting up: * ingres...
[16:29:09] <ottomata>	 aqu1: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/23#note_4107
[16:30:28] <wikibugs>	 (03PS43) 10AGueyte: Basic ipinfo instrument setup [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415)
[16:34:20] <wikibugs>	 (03CR) 10AGueyte: Basic ipinfo instrument setup (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte)
[16:35:42] <wikibugs>	 10Analytics, 10Data-Engineering, 10Event-Platform, 10Metrics-Platform, 10Browser-Support-Microsoft-Edge: Problem with delay caused by intake-analytics.wikimedia.org - https://phabricator.wikimedia.org/T295427 (10EChetty)
[16:35:57] <wikibugs>	 10Analytics, 10Data-Engineering, 10Event-Platform, 10Metrics-Platform, 10Browser-Support-Microsoft-Edge: Problem with delay caused by intake-analytics.wikimedia.org - https://phabricator.wikimedia.org/T295427 (10EChetty)
[16:36:36] <wikibugs>	 10Data-Engineering: Analytics Platform Future State Planing - https://phabricator.wikimedia.org/T302728 (10EChetty)
[16:37:37] <aqu1>	 ottomata: Thanks for the commands !
[16:55:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[16:55:34] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Airflow concurrency limits - https://phabricator.wikimedia.org/T300870 (10Antoine_Quhen) a:05Ottomata→03Antoine_Quhen
[17:05:05] <ottomata>	 aqu1:  yeah sorry it is not more straightforward
[17:05:15] <ottomata>	 i bet we could make the run_dev_instance.sh stuff support all that
[17:05:31] <ottomata>	 but, it won't work if you try to change users after creating the dev instance
[17:10:36] <addshore>	 org.apache.spark.api.python.PythonUtils.isEncryptionEnabled does not exist in the JVM
[17:10:37] <addshore>	 D:
[17:15:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[18:26:41] <joal>	 milimetric: I have read the code of the webrequest-checker and I think I understand what troubles you in its results
[18:27:48] <joal>	 the results contain every checked hosts results except for the ones having a min(sequence-id) = 0 - this therefore includes hosts having no sequence-id misalignement
[18:29:41] <joal>	 The results says "false_positive = TRUE" when there is no data loss - So the behavior is expected, even if it's cumbursome to read - We could change the check from false_positive to real_data_loss to make it more explicit (and change the value as well obviously :)
[18:31:38] <milimetric>	 joal: makes sense, but couldn't we exclude the min(sequence-id) <> 0 results if there's nothing else wrong with them?
[18:31:52] <milimetric>	 we know based on records previous hour and records next hour?
[18:32:42] <joal>	 It's usually useful to have them in there, to compare in case there some data-loss - If it's prefered to remove them, we can do it :)
[18:41:59] <wikibugs>	 10Analytics, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review, 10Readers-Web-Backlog (Kanbanana-FY-2021-22): WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10LGoto)
[18:59:24] <wikibugs>	 (03CR) 10Phuedx: [C: 04-1] Metrics Platform event schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan)
[19:02:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[19:12:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[19:22:07] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10Mayakp.wiki) >>! In T300164#7714313, @JAllemandou wrote: > I have started a job extracting daily pageviews...
[19:26:50] <joal>	 Gone for tonight - See you tomorrow folks
[19:28:13] <wikibugs>	 (03CR) 10Ottomata: Metrics Platform event schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan)
[21:01:40] <wikibugs>	 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Set up opensearch cluster for datahub - https://phabricator.wikimedia.org/T301382 (10razzi) Vms have been created and are all running opensearch 1.2.4!  Side note: puppet needed to be run two times for some reason. As I understand puppet should a...
[21:02:26] <wikibugs>	 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Set up opensearch cluster for datahub - https://phabricator.wikimedia.org/T301382 (10razzi)
[22:14:35] <wikibugs>	 (03PS3) 10Sharvaniharan: Add a required field in mobile_apps fragment Added a new required field in a new fragment that will be used only by the android app. We will maintain any future android related fields here. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/765671 (https://phabricator.wikimedia.org/T299239)
[22:15:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add a required field in mobile_apps fragment Added a new required field in a new fragment that will be used only by the android app. We will maintain any future android related fields here. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/765671 (https://phabricator.wikimedia.org/T299239) (owner: 10Sharvaniharan)
[22:25:33] <wikibugs>	 (03CR) 10Sharvaniharan: "@Mikhail I have also renamed the folder to `wikipedia_android_app` from `android_app`, because of the conversation @Ottomata and I had off" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/765671 (https://phabricator.wikimedia.org/T299239) (owner: 10Sharvaniharan)
[22:31:20] <wikibugs>	 (03CR) 10Ottomata: Add a required field in mobile_apps fragment Added a new required field in a new fragment that will be used only by the android app. We will (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/765671 (https://phabricator.wikimedia.org/T299239) (owner: 10Sharvaniharan)
[22:34:10] <wikibugs>	 (03CR) 10Sharvaniharan: Add a required field in mobile_apps fragment Added a new required field in a new fragment that will be used only by the android app. We will (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/765671 (https://phabricator.wikimedia.org/T299239) (owner: 10Sharvaniharan)
[22:39:12] <wikibugs>	 (03PS2) 10Krinkle: build: Add brief "Getting started" guide [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/714874
[22:39:15] <wikibugs>	 (03PS5) 10Krinkle: build: Document simpler alternative contribution flow [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/714875 (https://phabricator.wikimedia.org/T290074)
[22:41:49] <wikibugs>	 10Analytics, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Users should run explicit commands to materialize schema versions, rather than using magic git hooks - https://phabricator.wikimedia.org/T290074 (10Krinkle) >>! In T290074#7683286, @Ottomata wrote: > The change isn't hard, but the docum...
[22:55:13] <wikibugs>	 (03CR) 10Ottomata: build: Document simpler alternative contribution flow (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/714875 (https://phabricator.wikimedia.org/T290074) (owner: 10Krinkle)
[22:57:09] <wikibugs>	 10Analytics, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Users should run explicit commands to materialize schema versions, rather than using magic git hooks - https://phabricator.wikimedia.org/T290074 (10Ottomata) Ah right!  And, I guess this won't remove any previously added .git hooks in p...
[23:23:56] <wikibugs>	 10Analytics, 10Metrics-Platform, 10Performance-Team, 10Product-Analytics, 10MW-1.35-notes (1.35.0-wmf.21; 2020-02-25): Switch mw.user.sessionId back to session-cookie persistence - https://phabricator.wikimedia.org/T223931 (10nshahquinn-wmf) It would be helpful to close this task and open a separate one...