[10:00:28] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Release Pipeline: Upload required datahub dependencies to Archiva - https://phabricator.wikimedia.org/T301886 (10BTullis) Downloaded the jmx_prometheus_agent files. ` btullis@marlin-wsl-wsl:~/tmp$ wget-q https://repo1.maven.org/maven2/io/pro... [10:14:13] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Release Pipeline: Upload required datahub dependencies to Archiva - https://phabricator.wikimedia.org/T301886 (10BTullis) Downloaded the opentelemetry-javaagent files. ` btullis@marlin-wsl-wsl:~/tmp$ wget -q https://repo1.maven.org/maven2/io/... [10:19:20] (03PS8) 10Btullis: Add configuration for deployment pipeline [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/762950 (https://phabricator.wikimedia.org/T301453) [10:23:22] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Release Pipeline: Upload required datahub dependencies to Archiva - https://phabricator.wikimedia.org/T301886 (10BTullis) I'm not going to store the `dockerize` componenet in archiva, but I'll validate that against its sha1 hash instead. [10:24:09] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Release Pipeline: Upload required datahub dependencies to Archiva - https://phabricator.wikimedia.org/T301886 (10BTullis) 05Open→03Resolved [10:24:11] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Release Pipeline, 10Patch-For-Review: Create DataHub containers with deployment pipeline - https://phabricator.wikimedia.org/T301453 (10BTullis) [10:26:20] (03CR) 10Btullis: Add configuration for deployment pipeline (031 comment) [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/762950 (https://phabricator.wikimedia.org/T301453) (owner: 10Btullis) [10:48:28] I merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/761884 so it's no longer possible to pool the old AQS servers. [10:52:52] nice :) [11:01:09] 10Data-Engineering, 10Data-Engineering-Kanban: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10elukey) [11:04:02] 10Data-Engineering, 10Data-Engineering-Kanban: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10elukey) I updated the task's description with the current varnishkafka behavior, if nobody opposes I'd like... [11:07:53] (03CR) 10Nikerabbit: "Could someone review and merge this patch? It's been waiting for two and half months." [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/742234 (owner: 10Amire80) [11:23:35] (03PS9) 10Btullis: Add configuration for deployment pipeline [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/762950 (https://phabricator.wikimedia.org/T301453) [11:39:22] (03CR) 10Btullis: [C: 03+2] Remove double space from two messages [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/742234 (owner: 10Amire80) [11:40:49] (03Merged) 10jenkins-bot: Remove double space from two messages [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/742234 (owner: 10Amire80) [11:45:28] (03CR) 10Btullis: Remove double space from two messages (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/742234 (owner: 10Amire80) [13:08:07] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Define the Kubernetes Deployments for Datahub - https://phabricator.wikimedia.org/T301454 (10BTullis) The official helm charts for datahub use a number of subcharts: https://github.com/acryldata/datahub-helm/tree/master/char... [13:35:16] (03CR) 10Hashar: "recheck after https://gerrit.wikimedia.org/r/c/integration/config/+/763207" [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/762950 (https://phabricator.wikimedia.org/T301453) (owner: 10Btullis) [13:36:31] (03CR) 10jerkins-bot: [V: 04-1] Add configuration for deployment pipeline [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/762950 (https://phabricator.wikimedia.org/T301453) (owner: 10Btullis) [13:38:14] (03CR) 10Btullis: "The corresponding change to the integration/config respository is here:" [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/762950 (https://phabricator.wikimedia.org/T301453) (owner: 10Btullis) [13:51:22] <_joe_> joal: hi! [14:00:02] Hi _joe_ :) [14:09:11] heya joal :] [14:09:51] Hi mforns :) [14:11:03] I just saw your message! [14:15:13] (03PS5) 10Aqu: Migrate AQS/hourly [analytics/refinery] - 10https://gerrit.wikimedia.org/r/756601 (https://phabricator.wikimedia.org/T299398) [14:15:40] do you know if we're using log4j with spark? I recall we had issues with it, so not sure what's the state. If we use it, we could specify some log-level options to reduce the size of the logs when calling from skein. [14:24:29] PROBLEM - cache_text: Varnishkafka webrequest Delivery Errors per second -drmrs- on alert1001 is CRITICAL: 5.65 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=drmrs+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All [14:27:15] RECOVERY - cache_text: Varnishkafka webrequest Delivery Errors per second -drmrs- on alert1001 is OK: (C)5 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=drmrs+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All [14:35:29] ah lovely :) [14:35:39] this is Marseille --^ (not serving traffic) [14:42:17] 10Data-Engineering: Upgrade Turnilo - https://phabricator.wikimedia.org/T301990 (10Milimetric) [14:43:48] I noticed elukey. Btw, I was thinking, it would be nice to ingest all the server metadata so we can use it to run quality checks. Like "these servers are supposed to be publishing webrequest to kafka" and we "select distinct hostname from webrequest" to make sure everything checks out [14:46:58] milimetric: it would be a little tricky in my opinion, one easy way is to just add a monitor for each vk instances that checks if any datapoint has been sent in the past X minutes [14:47:29] the new nodes should be either downtimed or not firing alerts when being prepped (the above alert is likely a mistake) [14:47:50] and once a node is in production and serving traffic, it will surely send some data [14:47:55] if not, we get an alert [14:48:43] makes sense... I feel like to do that we'd have to have known about the problem ahead of time. I'm trying to figure a way to discover problems that we don't know about, based on what we think the puppet configuration should be doing [14:50:05] yeah I see, I am suggesting a simpler approach because getting what's in puppet may not be straightforward, and there could be corner cases with prod's status [15:04:07] (03PS10) 10Btullis: Add configuration for deployment pipeline [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/762950 (https://phabricator.wikimedia.org/T301453) [15:11:44] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Release Pipeline, 10Patch-For-Review: Create DataHub containers with deployment pipeline - https://phabricator.wikimedia.org/T301453 (10BTullis) The first container build is happening now: https://integration.wikimedia.org/ci/job/datahub-pi... [15:12:48] (03CR) 10Btullis: "I see. I misunderstood the syntax of .pipeline/config.yaml" [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/762950 (https://phabricator.wikimedia.org/T301453) (owner: 10Btullis) [15:36:14] (03CR) 10jerkins-bot: [V: 04-1] Add configuration for deployment pipeline [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/762950 (https://phabricator.wikimedia.org/T301453) (owner: 10Btullis) [15:56:12] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Define the Kubernetes Deployments for Datahub - https://phabricator.wikimedia.org/T301454 (10akosiaris) Hi, That could be a valid way forward, however there are others. Let me point out some pros and cons with this approach... [15:58:31] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Define the Kubernetes Deployments for Datahub - https://phabricator.wikimedia.org/T301454 (10akosiaris) >>! In T301454#7718086, @BTullis wrote: > The official helm charts for datahub use a number of subcharts: https://github... [16:16:30] (03PS1) 10Mforns: Release 2.9.3 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/763546 [16:17:41] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Deploying!" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/763546 (owner: 10Mforns) [16:18:05] !log deployed wikistats2 [16:18:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:21:54] (03PS1) 10Bearloga: movement_metrics: Remove error notebook and improve docs [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/763552 (https://phabricator.wikimedia.org/T295733) [16:22:39] (03CR) 10Bearloga: [V: 03+2 C: 03+2] movement_metrics: Remove error notebook and improve docs [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/763552 (https://phabricator.wikimedia.org/T295733) (owner: 10Bearloga) [16:24:08] 10Data-Engineering, 10Patch-For-Review, 10Product-Analytics (Kanban): Test log file and error notification - https://phabricator.wikimedia.org/T295733 (10mpopov) @Mayakp.wiki I removed the notebook and used the opportunity to make some improvements to the documentation. I think this task is OK to resolve now? [16:26:18] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Configure MariaDB database for DataHub on an-coord1001 - https://phabricator.wikimedia.org/T301459 (10Milimetric) [16:34:39] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Evaluate DataHub as a Data Catalog - https://phabricator.wikimedia.org/T299703 (10Milimetric) [16:36:46] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Evaluate OpenMetadata as a Data Catalog - https://phabricator.wikimedia.org/T300540 (10Milimetric) [16:36:48] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Evaluate OpenMetadata as a Data Catalog - https://phabricator.wikimedia.org/T300540 (10Milimetric) [16:57:46] (03PS11) 10Btullis: Add configuration for deployment pipeline [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/762950 (https://phabricator.wikimedia.org/T301453) [17:03:29] mforns: standup [17:03:49] oh and razzi [17:10:13] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Evaluate Atlas as a Data Catalog - https://phabricator.wikimedia.org/T299166 (10Milimetric) [17:32:47] (03CR) 10jerkins-bot: [V: 04-1] Add configuration for deployment pipeline [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/762950 (https://phabricator.wikimedia.org/T301453) (owner: 10Btullis) [17:49:58] (03CR) 10Nikerabbit: Remove double space from two messages (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/742234 (owner: 10Amire80) [17:54:19] 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Set up opensearch cluster for datahub - https://phabricator.wikimedia.org/T301382 (10ops-monitoring-bot) Icinga downtime set by razzi@cumin1001 for 7 days, 0:00:00 1 host(s) and their services with reason: Node is being set up for first time and... [17:58:56] 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Set up opensearch cluster for datahub - https://phabricator.wikimedia.org/T301382 (10razzi) I ran puppet and got an error installing the opensearch package. I ran puppet twice to see if it was missing an implicit dependency, but it failed on the... [18:00:42] (03PS12) 10Btullis: Add configuration for deployment pipeline [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/762950 (https://phabricator.wikimedia.org/T301453) [18:01:28] 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Set up opensearch cluster for datahub - https://phabricator.wikimedia.org/T301382 (10elukey) ` elukey@apt1001:/srv/wikimedia$ sudo reprepro lsbycomponent opensearch opensearch | 1.2.4 | buster-wikimedia | thirdparty/opensearch1 | amd64 ` The dat... [18:01:35] razzi: I left a comment for opensearch, there are no packages for bullseye-wikimedia [18:01:58] not sure if observability is working on it, but at the moment it is not supported [18:02:29] Got it, thanks elukey [18:03:46] Oh well, better rebuild it as buster then. [18:05:15] there may be something already in progress, in theory the observability team works the most with those packages, worth to ask [18:10:44] Ok yeah I asked in #wikimedia-observability [18:11:11] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review, 10Readers-Web-Backlog (Kanbanana-FY-2021-22): WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Jdrewniak) a:05Jdrewniak→03None [18:19:59] (03PS13) 10Btullis: Add configuration for deployment pipeline [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/762950 (https://phabricator.wikimedia.org/T301453) [18:51:18] (03CR) 10jerkins-bot: [V: 04-1] Add configuration for deployment pipeline [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/762950 (https://phabricator.wikimedia.org/T301453) (owner: 10Btullis) [19:23:03] heya razzi :] we just realized that aqu does not have merging rights in the refinery repo, what should I do to get them for him? [19:41:09] hi mforns, back from lunch now, let me take a look [19:41:15] 10Data-Engineering, 10Patch-For-Review, 10Product-Analytics (Kanban): Test log file and error notification - https://phabricator.wikimedia.org/T295733 (10mpopov) >>! In T295733#7697609, @mpopov wrote: >> I was thinking that the default value should be the `$title` of the resource, because that will match the... [20:29:49] thanks razzi! :] [20:36:31] Hi so mforns I can see the analytics team members on gerrit here: https://gerrit.wikimedia.org/r/admin/groups/d34747bee94be39cff54b5fda1ae36b575107792,members [20:37:06] However I must not administrator permissions myself [20:37:18] oh, yea, I haven't either... [20:39:02] thanks anyway razzi, let's bring that up next standup! [21:24:48] PROBLEM - Check unit status of eventlogging_to_druid_network_flows_internal_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_network_flows_internal_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:02:00] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, 10Superset: Help with data that's not appearing on charts - https://phabricator.wikimedia.org/T301895 (10Iflorez) This seems like a bug. It doesn't seem like a caching issue nor something related to date ranges. Our regula... [22:10:18] RECOVERY - Check unit status of eventlogging_to_druid_network_flows_internal_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_network_flows_internal_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:24:08] PROBLEM - Check unit status of eventlogging_to_druid_network_flows_internal_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_network_flows_internal_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:30:51] (03PS1) 10Joal: [WIP] Add flink job reporting webrequest patterns [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/763610 [22:35:14] Hey folks. We assume it's expected because of spelling but can you confirm 22:30:13 fyi i just merged a patch on cookbook to remove datahubsearche1002 is ok [22:35:45] the netbox cookbook that is [22:36:55] a-team [22:37:41] Hi RhinosF1 - our ops are not up at that time - I don't think you'll get an answer before tomorrow morning [22:38:32] joal: not running the dns cookbook is a kinda not so great thing [22:38:50] As it leaves a shock for other people [22:38:58] luckily this was an easy one to guess [22:41:22] RhinosF1: I am sorry but I barely understand what you're about - I understand something wrong has been done and I am sorry for that, but I can't help [22:43:24] Gone for tonight team [22:51:41] RhinosF1: yeah that's fine [22:53:22] Ty razzi [23:09:30] RECOVERY - Check unit status of eventlogging_to_druid_network_flows_internal_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_network_flows_internal_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:23:14] PROBLEM - Check unit status of eventlogging_to_druid_network_flows_internal_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_network_flows_internal_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers