[00:19:57] 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Set up opensearch cluster for datahub - https://phabricator.wikimedia.org/T301382 (10razzi) @colewhite could you perhaps weigh in on how to get the ferm rules accepting traffic between the elasticsearch nodes? It looks like opensearch instances... [01:04:55] 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Set up opensearch cluster for datahub - https://phabricator.wikimedia.org/T301382 (10colewhite) >>! In T301382#7749398, @razzi wrote: > @colewhite could you perhaps weigh in on how to get the ferm rules accepting traffic between the elasticsearch... [01:25:10] 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Set up opensearch cluster for datahub - https://phabricator.wikimedia.org/T301382 (10razzi) Cool, thanks @colewhite. I also threw in the `::profile::opensearch::monitoring::base_checks` since it seems to implement a simple, useful http check. [01:42:52] 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Set up opensearch cluster for datahub - https://phabricator.wikimedia.org/T301382 (10ops-monitoring-bot) Icinga downtime set by razzi@cumin1001 for 7 days, 0:00:00 3 host(s) and their services with reason: Still having errors setting up opensearc... [01:44:40] 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Set up opensearch cluster for datahub - https://phabricator.wikimedia.org/T301382 (10razzi) The firewall changes appear to be successful: ` razzi@datahubsearch1001:/etc/ferm$ nc -z datahubsearch1002 9300 razzi@datahubsearch1001:/etc/ferm$ echo $... [02:12:39] 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Set up opensearch cluster for datahub - https://phabricator.wikimedia.org/T301382 (10razzi) Ok, at least the missing jvm.options is that I did a `sudo systemctl restart opensearch` when I should have used the service `opensearch_1@datahub.service... [09:45:03] (03PS1) 10Btullis: Revert "Override the location of the pidfile for datahub-frontend" [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767693 [09:45:50] (03CR) 10Btullis: [C: 03+2] Revert "Override the location of the pidfile for datahub-frontend" [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767693 (owner: 10Btullis) [09:55:14] (03PS1) 10Btullis: Change the CWD of the datahub-frontend [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767725 (https://phabricator.wikimedia.org/T301453) [10:08:33] (03CR) 10jerkins-bot: [V: 04-1] Change the CWD of the datahub-frontend [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767725 (https://phabricator.wikimedia.org/T301453) (owner: 10Btullis) [10:11:40] 10Data-Engineering, 10MediaWiki-extensions-WikimediaEvents, 10Patch-For-Review: Remove SearchSatisfactionErrors EventLoggingSchemas entry - https://phabricator.wikimedia.org/T302895 (10phuedx) >>! In T302895#7748689, @EBernhardson wrote: > The related code was [[ https://gerrit.wikimedia.org/r/c/mediawiki/ex... [10:15:15] (03CR) 10jerkins-bot: [V: 04-1] Revert "Override the location of the pidfile for datahub-frontend" [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767693 (owner: 10Btullis) [10:42:20] 10Data-Engineering, 10MediaWiki-extensions-WikimediaEvents, 10Product-Analytics: Remove CompletionSuggestions EventLoggingSchemas entry - https://phabricator.wikimedia.org/T302894 (10phuedx) [10:55:53] (03PS2) 10Btullis: Change the directory structure of datahub-frontend [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767725 (https://phabricator.wikimedia.org/T301453) [11:10:02] (03CR) 10jerkins-bot: [V: 04-1] Change the directory structure of datahub-frontend [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767725 (https://phabricator.wikimedia.org/T301453) (owner: 10Btullis) [11:18:21] (03Abandoned) 10Btullis: Change the directory structure of datahub-frontend [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767725 (https://phabricator.wikimedia.org/T301453) (owner: 10Btullis) [11:21:07] (03PS1) 10Btullis: Update the path of the user.props file [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767743 (https://phabricator.wikimedia.org/T301453) [12:03:22] (03CR) 10Btullis: [V: 03+2 C: 03+2] Revert "Override the location of the pidfile for datahub-frontend" [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767693 (owner: 10Btullis) [12:19:51] (03CR) 10Btullis: [C: 03+2] Update the path of the user.props file [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767743 (https://phabricator.wikimedia.org/T301453) (owner: 10Btullis) [13:16:51] (03PS1) 10Btullis: Add new containers for the backend setup tasks [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767782 (https://phabricator.wikimedia.org/T301454) [13:31:00] (03PS2) 10Btullis: Add new containers for the backend setup tasks [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767782 (https://phabricator.wikimedia.org/T301454) [13:43:45] (03CR) 10Btullis: [C: 03+2] Add new containers for the backend setup tasks [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767782 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [14:06:35] joal: o/ [14:06:43] things looking good, i can iteratet! [14:06:45] but q [14:07:11] it seems that the job metrics likke JobSuceeded etc. are emittetd from the launching process [14:07:16] and the kafka ones from thet worker [14:07:43] so, i'm not sure how we can associate the job suceeded ones with the kafka toppars [14:07:51] or, am I misunderstanding something? [14:21:32] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "I would click +2, but can't in this codebase." [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/753470 (https://phabricator.wikimedia.org/T260297) (owner: 10Ladsgroup) [14:36:18] ottomata: you understand the thing very well :) for the "main" metrics (the ones fron the launcher), we can't link them to kafka topics - and IIRC it's not needed [14:42:35] ok, so just the job label is sufficient? [14:42:47] joal qq, is there a difference between 'pull' and 'extract' in these metric names [14:42:54] some are like # records extracted [14:42:59] vs avgRecordPullTime [14:43:02] what SRE wants to push the mw history snapshot live? https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS#Deploy_new_History_snapshot_for_Wikistats_Backend [14:43:05] https://gerrit.wikimedia.org/r/c/operations/puppet/+/767792 [14:43:12] is the 'pull' time the same as thet 'extract' time? [14:43:23] milimetric: i guess it smy ops week! [14:43:30] ottomata: not yet, standup [14:43:37] and you're not SRE anymore [14:43:41] :P [14:43:46] oh hahahah [14:43:48] but i can do it! [14:43:49] but you can do it if you want [14:43:49] anyway [14:44:26] milimetric: you can test ya? [14:44:31] yep [14:44:34] milimetric: Hi! have you checked the datasource is available in druid? [14:44:44] i just merged, let me know if i should not [14:45:05] sorry maybe I messed up [14:45:13] I remember now you said to be careful a few weeks ago [14:45:28] not available yet [14:45:45] we should change that email alerting, and make the deploy automagic [14:45:57] I don't know how, but we should about that [14:46:08] reverting [14:46:14] ottomata: you don't have to [14:46:24] it'll be available in a matter of hours normally [14:46:29] it won't pick it up until you roll restart [14:46:36] right, but i'd rather do those together [14:46:41] right, it started 1 hour ago: https://hue.wikimedia.org/hue/jobbrowser/#!id=0067736-220113112502223-oozie-oozi-W [14:46:56] as you prefer ottomata [14:46:59] ok, I'll watch the job and ping when it's done. Sorry! I do think we should do this automatically [14:47:11] (03CR) 10Ladsgroup: "Adding reviewer from previous patches." [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/753470 (https://phabricator.wikimedia.org/T260297) (owner: 10Ladsgroup) [14:47:15] (03CR) 10Ladsgroup: "recheck" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/753470 (https://phabricator.wikimedia.org/T260297) (owner: 10Ladsgroup) [14:47:46] joal: pull vs extract? [14:47:55] ottomata: thinking about it [14:47:58] ha k :) [14:48:45] ottomata: I can't see difference - possibly they happen at different point in the flow, and therefore pull would be before deserialization, and extract after, but those are pure guesses [14:50:42] just wondering if i should make those metric names consistent [14:51:02] other question ottomata: do we need both? [14:51:38] joal maybe answer in https://gobblin.readthedocs.io/en/latest/case-studies/Kafka-HDFS-Ingestion/ ? [14:51:41] Source and Extractor [14:51:49] ottomata: in my list of stuff to use, I listed the extracvtor metrics [14:52:00] you have one pull metric too [14:52:05] gobblin_kafka_partition_average_record_pull_time [14:52:30] yeah that one is pure-kafka [14:52:49] ok yeah [14:52:53] it does seem they are different [14:53:05] so pull is time from kafka [14:53:09] and extract is everythign in gobblin? [14:53:15] i'll keep the different names then [14:53:15] ty [14:53:29] np ottomata - sorry for not being very precise :S [14:53:42] no you were! [14:53:43] just good to know. [14:54:26] (03PS1) 10Btullis: Added a README-WMF file [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767794 [14:54:35] And, you'll find the kafka-topic-partition data on those kafka-oriented events/metrics - while the gobblin ones don't have them [14:54:39] ottomata: --^ [14:56:09] yes [14:56:18] joal what was the need for the map then? [14:56:44] i thought the reason was to collect everything and then add use the topic partition info from the kafka metrics to also add topic partition labels to the non kafka metrics [14:56:46] ottomata: cause we need the topic-partition as labels for probemetheus, nothing else [14:56:59] right...but i think [14:57:23] we can just create the gauge metrics and set the values on them with the labels when reportEventQueue is called [14:57:26] and then on close [14:57:37] finally just call pushGateway. [14:57:43] to have all the metrics pushed then [14:57:54] since the metrics are stored in the CollectorRegistry anyway [14:58:32] ottomata: whether you do the job upfront and use the CollectorRegistry as a storage, or do it after, is a matter of choice :) [14:58:50] mk :) [14:59:22] ottomata: the thing is, we're not sure that there will be one task per metric-reporter - there could be more than one [14:59:39] And in that case, we need to keep the task_id associated with the gauge [15:00:14] do we need the task_id? All we need is the topic and partition, right? [15:01:48] ottomata: yes, but in order to reassign after having received events, you need a way to differenciate between tasks [15:02:06] if there is one than one task handled by the same reporter [15:02:43] Gone for kids - back at standup :) [15:02:52] hmm, oh ok i don't understand, lets bc later [15:02:57] ack ottomata [15:03:04] heya aqu :] I left some comments on your patch again. If you have time we can discuss in the Airflow troubleshooting meeting. [15:03:22] Wow, actually I have a meeting at standup time - will miss it [15:05:18] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Thanks!" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/753470 (https://phabricator.wikimedia.org/T260297) (owner: 10Ladsgroup) [15:06:44] (03PS2) 10Btullis: Added a README-WMF file [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767794 [16:04:58] (03CR) 10Btullis: [C: 03+2] Added a README-WMF file [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767794 (owner: 10Btullis) [16:05:03] 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Set up opensearch cluster for datahub - https://phabricator.wikimedia.org/T301382 (10colewhite) >>! In T301382#7749508, @razzi wrote: > Ok, at least the missing jvm.options is that I did a `sudo systemctl restart opensearch` when I should have us... [16:12:10] 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Set up opensearch cluster for datahub - https://phabricator.wikimedia.org/T301382 (10elukey) Hi Cole! >>! In T301382#7750936, @colewhite wrote: >>>! In T301382#7749508, @razzi wrote: >> Ok, at least the missing jvm.options is that I did a `sudo... [16:23:46] ah, joal I suppose gobblin_task_duration is the one we would like to submit with kafka topic and partition labels, yes? [16:44:11] (03PS1) 10Btullis: Fix the entrypoint [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767837 (https://phabricator.wikimedia.org/T301453) [16:46:29] (03CR) 10Btullis: [V: 03+2 C: 03+2] Fix the entrypoint [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767837 (https://phabricator.wikimedia.org/T301453) (owner: 10Btullis) [17:01:18] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for mediawiki_history_reduced_2022_02 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org [17:13:22] ottomata: mw reduced finished, so the segment should be available in druid, which makes that last alert strange [17:14:05] oh, no, it stabilized I guess as it was loading. Ok, so all set - roll-restart as you like [17:21:18] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for mediawiki_history_reduced_2022_02 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org [17:23:03] a-team: new standup time is now [17:32:02] (03PS1) 10Btullis: Edit the entrypoint and setup scripts [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767847 (https://phabricator.wikimedia.org/T301453) [17:43:18] 10Data-Engineering, 10Airflow, 10Product-Analytics: Add Product-Analytics Announcements to the oozie job for notifications - https://phabricator.wikimedia.org/T301281 (10EChetty) [17:43:56] 10Data-Engineering, 10Airflow, 10Product-Analytics: Add Product-Analytics Announcements to the oozie job for notifications - https://phabricator.wikimedia.org/T301281 (10EChetty) Going to change this to be airflow notifications instead of Oozie. [17:44:19] milimetric: ready to roll restart lemme know when you can test [17:44:39] ready ottomata [17:47:47] ok milimetric canary ready for test [17:48:28] ottomata: checks out! [17:48:31] !log roll restart aqs to pick up new MW history snapshot [17:48:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:48:35] (I just use: curl localhost:7232/analytics.wikimedia.org/v1/edits/aggregate/all-projects/all-editor-types/all-page-types/monthly/2020020100/2022030300 | grep 2022 --color) [17:48:44] before and after the push [17:48:56] hm, full roll restart failed [17:49:05] on aqs1004 so was aborted [17:49:18] trying again [17:50:00] hm [17:50:00] You cannot pool a node where weight is equal to 0 [17:50:45] Can we excluded these old hosts from the cookbook? [17:51:09] oh is that the issue? [17:51:13] doesn't look i can from the command [17:51:18] do i need tto submit a patch to the cookbook? [17:51:48] or is the to cumin aliases? [17:51:48] yeah I think that we target aqs: P{O:aqs or O:aqs_next} [17:51:50] Those hosts are just waiting to be decommissioned at the moment. My fault for not having completed it. [17:52:32] self.aqs_workers = self.remote.query('A:' + args.cluster) [17:52:54] (lemme know which nodes to test whenever you're done and I can verify nothing weird happened) [17:53:06] aqs-canary: P{aqs1004.eqiad.wmnet} [17:53:16] ottomata: look at line 28 of the cookbook [17:53:19] that is the issue [17:53:24] oh? [17:53:26] OH [17:53:29] i can do aqs_next ? [17:53:40] i see the alias [17:53:42] aqs-next [17:53:43] okay [17:53:45] in theory it is not in the choices, but we can add it [17:53:51] there is no aqs-next canary though? [17:53:58] but yeah no canary [17:54:12] so probably good to just change the cumin alias [17:54:17] btullis: if aqs is gone, should we just rename new to aqs and get rid of next? [17:54:54] Yes, it's on my todo list, https://phabricator.wikimedia.org/T302278 [17:55:08] can i do in aliases now? [17:56:44] I suppose so, but I would have thought it cleaner to delete the old aqs cluster first, then shuffle the names. [17:57:08] hmm, well i need to at least add an aqs-next-canary entry [17:57:17] in order to deploy to aqs-next [17:57:32] btullis: is this okay? [17:57:33] https://gerrit.wikimedia.org/r/c/operations/puppet/+/767853/1/modules/profile/templates/cumin/aliases.yaml.erb [17:57:42] it just removes old aqs from the alias [17:58:01] when you rename the puppet classes you can rename the reference here too? [17:58:01] or if you prefer i can just keep the aqs-next aliases and add canary [17:58:55] Nah, that's cool. [17:59:03] mkay danke [18:00:18] and thanks elukey :) [18:02:29] 10Data-Engineering: Migrate to GeoIP2 - https://phabricator.wikimedia.org/T302989 (10odimitrijevic) [18:03:24] (03CR) 10Btullis: [C: 03+2] Edit the entrypoint and setup scripts [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767847 (https://phabricator.wikimedia.org/T301453) (owner: 10Btullis) [18:11:05] joal: i have 20 mins before meeting if you have some time for me! [18:21:32] (03PS1) 10Btullis: Make these quotes uniform with the other setup tasks [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767859 [18:26:43] ottomata: sorry, my meeting at home ran longer than expected :( [18:26:53] will join the meeting now, and will have time after if you wish [18:27:59] mmkay! [18:44:22] (03PS3) 10Ottomata: [WIP] Add prometheus metrics reporter [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/767178 (owner: 10Joal) [18:45:29] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add prometheus metrics reporter [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/767178 (owner: 10Joal) [18:54:33] 10Data-Engineering: Migrate to GeoIP2 - https://phabricator.wikimedia.org/T302989 (10odimitrijevic) [19:13:41] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10observability, and 2 others: Kafka 2.x Upgrade Plan - https://phabricator.wikimedia.org/T302610 (10odimitrijevic) [19:16:36] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10observability, and 2 others: Kafka 2.x Upgrade Plan - https://phabricator.wikimedia.org/T302610 (10odimitrijevic) @elukey I updated the task description as I ask for it :) [19:27:01] 10Data-Engineering: Check home/HDFS leftovers of zpapierski - https://phabricator.wikimedia.org/T302779 (10odimitrijevic) [19:30:38] joal: in bc [19:30:43] joining! [19:32:16] 10Data-Engineering: Check home/HDFS leftovers of zpapierski - https://phabricator.wikimedia.org/T302779 (10odimitrijevic) p:05Triage→03High [19:32:47] 10Data-Engineering: Check home/HDFS leftovers of ema - https://phabricator.wikimedia.org/T302815 (10odimitrijevic) [19:33:00] 10Data-Engineering: Check home/HDFS leftovers of ema - https://phabricator.wikimedia.org/T302815 (10odimitrijevic) p:05Triage→03High [19:35:44] 10Analytics-Radar, 10Anti-Harassment, 10CheckUser, 10Privacy Engineering, and 5 others: Deal with Google Chrome User-Agent deprecation - https://phabricator.wikimedia.org/T242825 (10TheresNoTime) [19:36:21] 10Data-Engineering: Check home/HDFS leftovers of rhuang-ctr - https://phabricator.wikimedia.org/T302194 (10odimitrijevic) p:05Triage→03High [19:37:47] 10Data-Engineering, 10Code-Health-Objective, 10Epic, 10Platform Engineering Roadmap, and 2 others: [DISCUSS]: Problem details for HTTP APIs (rfc7807) - https://phabricator.wikimedia.org/T302536 (10odimitrijevic) [19:56:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org [20:02:49] (03PS3) 10Krinkle: build: Add brief "Getting started" guide [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/714874 [20:02:51] (03PS6) 10Krinkle: build: Document simpler alternative contribution flow [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/714875 (https://phabricator.wikimedia.org/T290074) [20:04:58] (03CR) 10Krinkle: build: Document simpler alternative contribution flow (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/714875 (https://phabricator.wikimedia.org/T290074) (owner: 10Krinkle) [20:11:14] (03PS1) 10Btullis: Trivial change [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767876 [20:11:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org [20:11:45] (03CR) 10Btullis: [V: 03+2 C: 03+2] Trivial change [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767876 (owner: 10Btullis) [20:33:18] (03CR) 10Ottomata: "Great, lemme just do some communication about this, and we should be good" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/714875 (https://phabricator.wikimedia.org/T290074) (owner: 10Krinkle) [20:34:29] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 (10Ottomata) a:05JAllemandou→03Ottomata [20:36:47] joal: FYI it wasn't working because i had a hardcoed bootstrap.with.offset [20:38:34] and joal i get metrics for both topics! [20:38:39] with labels set properly [20:38:41] no grouping key needed? [20:42:55] (03PS1) 10Btullis: Fix paths and python [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767888 [20:43:42] (03CR) 10Btullis: [V: 03+2 C: 03+2] Fix paths and python [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767888 (owner: 10Btullis) [20:48:42] (03PS1) 10Btullis: Add the curl package to the the final production es-setup container [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767890 [20:48:51] ottomata: Great! Let's assume no grouping key is needed for now, and ask Philipo tomorrow :) [20:49:02] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add the curl package to the the final production es-setup container [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767890 (owner: 10Btullis) [20:51:16] ok! ok goal for metricas tho. [20:51:22] what was the plan to report these buy? [20:51:23] by* [20:51:26] taskId? [20:51:38] you have them all marked as metrics per task [21:35:05] 10Data-Engineering, 10MediaWiki-extensions-WikimediaEvents, 10Product-Analytics, 10MW-1.38-notes (1.38.0-wmf.25; 2022-03-07): Remove CompletionSuggestions EventLoggingSchemas entry - https://phabricator.wikimedia.org/T302894 (10phuedx) 05Open→03Resolved a:03phuedx [21:35:07] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, and 3 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10phuedx) [21:35:39] 10Data-Engineering, 10MediaWiki-extensions-WikimediaEvents, 10MW-1.38-notes (1.38.0-wmf.25; 2022-03-07): Remove SearchSatisfactionErrors EventLoggingSchemas entry - https://phabricator.wikimedia.org/T302895 (10phuedx) 05Open→03Resolved a:03phuedx [21:35:44] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, and 3 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10phuedx) [21:42:11] (03PS1) 10Btullis: Update the kafka-topic commands to match our version [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767898 [21:42:44] (03CR) 10Btullis: [V: 03+2 C: 03+2] Update the kafka-topic commands to match our version [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/767898 (owner: 10Btullis) [22:04:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org [22:09:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org [23:02:21] hiiiii! here with moar quick questions... this time about EventGate... just wondering 1) if the HTTP service consumes events directly from the Web, or if it sits behind Varnish, or something else; 2) if it runs on the same servers as Varnish, or is it on separate boxes; and 3) what sort of process did you go through to verify robustness in terms of performance, internal networking [23:02:23] requests, and ability to withstand, or at least not bring down the site, under high load? [23:02:31] many thanks in advance!!!!! :) [23:03:21] (ok hmmm maybe those aren't necessarily super quick questions... ;p if anyone has some links about them easily on hand, that'd be more than enuf... thanks again!) [23:08:35] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering, 10I18n: WikiReportsLocalizations.pm still fetches language names from SVN - https://phabricator.wikimedia.org/T64570 (10Aklapper) Could you please answer the last comment? (Or do you know who could answer it?) Thanks in advance. [23:26:55] AndyRussG: some of your questions might be answered by https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate. For #2, the service is deployed on Kubernetes and has multiple "clusters" which have different connectivity (some internal, some with public endpoints) [23:29:50] bd808: hmmm right and no Kubernetes on the Varnish boxes... so I imagine something routes event http requests directly to the eventgate boxes, with no Varnish involved? [23:30:08] thx! [23:31:36] external endpoints would cross Varnish, but I would guess that the next layer both internal and external would be pybal and then the k8s cluster [23:33:35] hmmm interesting [23:34:11] so was there for example any concern that an issue on the EventGate service could cause the Varnishes that forward those requests to bork? [23:40:13] I don't think that traffic from Extension:EventBus on a prod wiki would route through a Varnish. Instead I would assume that hits LVS (which is monitored by pybal) pointed at the kubernetes ingress nodes [23:43:00] * bd808 is neither an SRE nor a dev of EventBus so take all of his guesses with appropriate amounts of salt [23:44:08] bd808: okok cool gotcha, yeah makes sense... thx so much!!!