[03:34:40] (03CR) 10Sharvaniharan: Add a required field in mobile_apps fragment (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/766897 (owner: 10Sharvaniharan) [03:53:15] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [03:58:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [05:19:16] !log rerunning monthly edit hourly druid oozie coordinator [05:19:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [05:19:22] (edit-druid-coord) [05:19:27] https://hue.wikimedia.org/hue/jobbrowser/#!id=0062946-210701181527401-oozie-oozi-C [05:23:15] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [05:28:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [08:45:20] 10Quarry, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, 10User-dcaro, 10cloud-services-team (Kanban): [quarry] Add 'feedback' link to pre-filled phabricator task - https://phabricator.wikimedia.org/T303028 (10dcaro) [09:37:50] (03PS1) 10Phuedx: analytics/legacy/quicksurveyinitiation: Add editCountBucket property [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/768014 [09:41:54] (03PS2) 10Phuedx: analytics/legacy/quicksurveyinitiation: Add editCountBucket property [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/768014 [09:54:35] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:17:49] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:16:22] 10Data-Engineering, 10Event-Platform, 10SRE, 10Traffic, and 2 others: Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10jcrespo) [11:24:05] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Define the Kubernetes Deployments for Datahub - https://phabricator.wikimedia.org/T301454 (10BTullis) I believe that my deployment-charts CR is now at a stage where it should be merged, so that I can begin working with it on... [11:26:14] Hi, someone wants to create https://meta.wikimedia.org/wiki/Config:MetaSync but it seems it's not allowed, Should it be https://meta.wikimedia.org/wiki/Config:Dashiki:MetaSync? [11:26:24] I honestly don't know how this work at all [11:27:38] Amir1: I'm afraid that's completely new to me too. [11:31:12] it's about https://phabricator.wikimedia.org/source/tool-cr-grants-team-metasync/browse/master/ [11:35:23] I have seen mforns create similar pages before (https://meta.wikimedia.org/w/index.php?title=Config:EEEnwikiMetrics&action=history) [11:36:09] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Define the Helm charts and helmfile deployments for Datahub - https://phabricator.wikimedia.org/T301454 (10BTullis) [13:01:14] 10Data-Engineering, 10Event-Platform, 10serviceops, 10Sustainability (Incident Followup): eventgate-* tls telemetry is disabled - https://phabricator.wikimedia.org/T303042 (10JMeybohm) [13:46:43] btullis: nice stuff with the helm charts, let me know if you want me to look through it. [13:48:03] Thanks. If you have time, you could see if my instructions to spin up a local dev environment work for you. [13:48:38] (03CR) 10Ottomata: [C: 03+1] Add a required field in mobile_apps fragment (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/766897 (owner: 10Sharvaniharan) [13:52:23] btullis: ok i'll try to do that shortly [13:53:39] Cool, thanks. If you jave minikube handy it should just be a case of: [13:53:57] https://www.irccloud.com/pastebin/owhifJLQ/ [13:54:05] i haven't run minikube in 2 years so i'm msure i'm going to have to install some stuff :p [13:54:16] AndyRussG: happy to talk if you are around [13:54:42] AndyRussG: https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#Benchmarking [13:55:10] external requests to EventGate do indeed come via varrinsh, as a simple proxy pass through (there is no caching of course) [13:55:36] There is no concern that EventGate could cause varnish to bork [13:56:02] huge #s of requests could cause varnish (or any part of that pipe) to bork [13:56:07] but that is true of any http service [13:56:30] varnish is only involved because it is the front end gateway [14:00:30] > huge #s of requests could cause varnish (or any part of that pipe) to bork [14:00:30] If I'm not mistaken, that's exactly what we saw this morning, overloading eventgate-analytics-external with an accidentally high sampling rate (100%) on a new banner that was rolled out. [14:02:42] btullis: +1 exactluy [14:03:07] varnish called ATS, ATS called eventgate that in turn caused connections to pile up [14:03:46] ottomata: --^ [14:03:53] (03PS3) 10Phuedx: analytics/legacy/quicksurveyinitiation: Add editCountBucket property [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/768014 [14:04:14] yeah was goign to ask for context about that, i just saw elukey 's patch com mein [14:04:51] (03CR) 10Ottomata: [C: 03+1] analytics/legacy/quicksurveyinitiation: Add editCountBucket property [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/768014 (owner: 10Phuedx) [14:09:37] elukey: ty, just read incident report [14:09:41] thanks for the response [14:10:05] ottomata: I was about to ping you in #security, super :) [14:10:33] elukey was there changes recently to have caused this? Running sampling at a 100% may be rare but it has been done in the past on large campaigns [14:11:04] (03CR) 10Gehel: "See minor comments inline." (035 comments) [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/767178 (owner: 10Joal) [14:15:34] (answered in pvt) [14:29:41] (03CR) 10Ottomata: "Thanks Gehel! Still super WIP, I pushed code example for Joal to see. Will respond to your comments once we get it all working." [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/767178 (owner: 10Joal) [14:37:46] btullis: what k8s version are you using in minikube? [14:40:07] Hmm. minikube itself is 1.25.1. Checking what k8s version I have, [14:45:56] btullis: i dunno my minikube / k8s is having issues [14:45:57] Error: apiVersion 'v2' is not valid. The value must be "v1" [14:46:05] i think my k8s is too old [14:46:15] hi Amir1! I think it should be just .../wiki/Dashiki:MetaSync [14:46:15] but all my attempts at upgrading are failling [14:46:27] Amir1: that's what I understand from https://wikitech.wikimedia.org/wiki/Analytics/Systems/Dashiki [14:46:41] still trying a few things [14:47:12] mforns: okay, I'm going to tell them this, fingers crossed [14:47:14] Thanks [14:50:59] ottomata: I think it's `Kubernetes v1.23.1` [15:00:34] 10Data-Engineering, 10Data-Catalog, 10SRE, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) I have created a patch to operations/deployment-charts that I believe will be a good start in enabling this service. https://gerrit.wikimedia.org/r/... [15:00:57] 10Data-Engineering, 10Data-Catalog, 10SRE, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) [15:01:03] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Define the Helm charts and helmfile deployments for Datahub - https://phabricator.wikimedia.org/T301454 (10BTullis) [15:08:40] btullis: ok i think i got a new version of helm that is working... [15:10:30] Great. I'm using minikube dashboard and a successful deploy shows me this; [15:10:34] https://usercontent.irccloud-cdn.com/file/uIsujGLO/image.png [15:10:55] 10Data-Engineering, 10Data-Catalog, 10SRE, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) [15:11:29] 10Data-Engineering, 10Data-Catalog, 10SRE, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) How can I tell what the source IP address(es) of my services will be, as seen by the back-end data stores? Will these be predicatabl... [15:11:56] btullis: looks like prerequisites mysql failed [15:11:57] Error: secret "mysql-secrets" not found [15:12:15] Ah, forgot to add that to the instructions. Hang on... [15:12:50] `kubectl create secret generic mysql-secrets --from-literal=mysql-root-password=datahub` [15:18:36] 10Data-Engineering, 10Data-Catalog, 10SRE, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) The diagram doesn't cover prometheus support, but it is included. I have added: `prometheus.io/port: 4318` and `prometheus.io/scrap... [15:20:26] btullis: o/ what I have seen in ferm rules has always been stuff like "SERVICES_KUBEPODS_NETWORKS" [15:20:41] (or say STAGING_KUBEPODS_NETWORKS etc..) [15:21:33] (everything is defined in network::constants) [15:21:42] elukey: Oh nice. That seems fine to me, as long as we don't need it to be any more granular than that. [15:21:54] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Patch-For-Review: Users should run explicit commands to materialize schema versions, rather than using magic git hooks - https://phabricator.wikimedia.org/T290074 (10Krinkle) [15:22:47] btullis: yeah definitely, but predicting the pod ips may be very hard [15:23:10] (the mlserve cluster has a similar rule for its ip addresses) [15:23:29] heya aqu :] do you mind if I break the rules and use the test cluster airflow instance to test a job now :P, are you using it by chance? [15:24:19] mforns: I am currently using dev instance. No problem ! [15:24:28] ok! thanks [15:34:41] btullis: sigh i'm going to work on some other things, minikube or k8s or helm now is busted, and won't run [15:34:54] i probably should just reinstall everything and understand it again [15:35:09] OK, not to worry. Sorry if I've wasted your time with it. [15:35:14] naw its ok [15:35:20] the prereq thing looks really nice [15:35:22] i like how you did that [15:35:37] :-) [15:37:12] Thatsnk. The three setup batch jobs were what took me all of yesterday to get working. These populate the mysql database, create the kafka topics and create the ES indices. Without these the local dev environment won't work. [15:37:29] s/Thatsnk/Thanks [15:41:45] e.g. the Kafka-setup job is this: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/764375/18/charts/datahub/templates/kafka-setup-job.yml [15:41:45] ...which which calls a helm pre-install hook to run a batch job with this container: https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/datahub/+/refs/heads/wmf/.pipeline/kafka-setup/blubber.yaml [15:41:45] ...to run this script: https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/datahub/+/refs/heads/wmf/docker/kafka-setup/kafka-setup.sh [15:42:58] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Define LVS load-balancing for OpenSearch cluster - https://phabricator.wikimedia.org/T301458 (10BTullis) a:03BTullis Now starting to work on this task. I think the first thing I'll have to do is as for a service IP to be allocated. [15:44:50] btullis: kudos https://phabricator.wikimedia.org/T303049 is very nice [15:45:39] elukey: 😊 Thanks [15:50:23] !log deployed airflow in an-test-client1001 to test skein log fix [15:50:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:04:37] 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Set up opensearch cluster for datahub - https://phabricator.wikimedia.org/T301382 (10BTullis) The log file for the service had lots of messages like this: ` [2022-03-04T00:00:17,947][WARN ][o.o.c.c.ClusterFormationFailureHelper] [datahubsearch10... [16:13:24] 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Set up opensearch cluster for datahub - https://phabricator.wikimedia.org/T301382 (10BTullis) Now this node has elected itself. ` [2022-03-04T16:07:37,540][INFO ][o.o.c.s.MasterService ] [datahubsearch1001-datahub] elected-as-master ([3] nodes... [16:27:25] joal: i take it the PrometheusMetricReporter scaffolding wasn't quite done, yes? [16:27:33] having trouble with classes and inheritence [16:31:01] 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Set up opensearch cluster for datahub - https://phabricator.wikimedia.org/T301382 (10BTullis) Re-enabled puppet and ran, reverting the changes to `/etc/opensearch/datahub/opensearch.yml` Restarted the service with `sudo systemctl restart opensear... [16:49:03] heya ottomata [16:49:06] haia [16:49:13] wanna chat quickly? [16:49:47] ya [17:20:12] ottomata: when you guys are done, I'd like to ask you sth about the skein log level :] [17:20:25] mforns: we done! [17:20:35] i saw your note about fixing that client vs master param! good catch [17:20:44] back in bc mforns [17:21:06] actually brb [17:22:13] back [17:25:47] ok [17:25:49] omw [17:37:04] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Define LVS load-balancing for OpenSearch cluster - https://phabricator.wikimedia.org/T301458 (10BTullis) I have assigned this address myself from NetBox, following these guidelines: https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_to_manually_a... [17:46:42] !log deployed Airflow to analytics instance to fix skein logs problem [17:46:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:47:49] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Define LVS load-balancing for OpenSearch cluster - https://phabricator.wikimedia.org/T301458 (10elukey) If useful, all the steps outlined in https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service [17:56:38] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Define LVS load-balancing for OpenSearch cluster - https://phabricator.wikimedia.org/T301458 (10BTullis) As per instruction from @ayounsi I have also reserved the corresponding address in codfw, in case the service ever becomes available in both... [18:08:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org [18:14:06] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Metrics-Platform, 10Browser-Support-Microsoft-Edge: Problem with delay caused by intake-analytics.wikimedia.org - https://phabricator.wikimedia.org/T295427 (10Torana) See also [[https://de.wikipedia.org/w/index.php?title=Wikipedia:Fragen_zur_Wikipedia&... [18:18:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org [18:41:22] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 (10Ottomata) @JAllemandou how did we plan to send the task metrics? I can't find any good unique label to use other than perha... [19:44:45] (03PS4) 10Ottomata: [WIP] Add prometheus metrics reporter [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/767178 (owner: 10Joal) [19:45:06] (03PS5) 10Ottomata: [WIP] Add prometheus metrics reporter [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/767178 (owner: 10Joal) [19:49:06] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add prometheus metrics reporter [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/767178 (owner: 10Joal) [21:00:15] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [21:05:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [21:19:30] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: Wikistats New Feature - https://phabricator.wikimedia.org/T303081 (10Servetsarrac) [21:19:37] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: Wikistats New Feature - https://phabricator.wikimedia.org/T303082 (10Servetsarrac) [21:22:27] 10Data-Engineering, 10Data-Engineering-Kanban: Out of disk space on stat1008 - https://phabricator.wikimedia.org/T303083 (10BTullis) [21:24:36] 10Data-Engineering, 10Data-Engineering-Kanban: Out of disk space on stat1008 - https://phabricator.wikimedia.org/T303083 (10BTullis) p:05Triage→03High Sudden increase since 20:12 this evening. {F34975540,} [21:25:36] I'm lookng at this issue with stat1008. [21:37:56] Fixed now. [22:01:38] 10Analytics, 10Analytics-Kanban, 10Data-Engineering-Kanban, 10wmfdata-python, 10Product-Analytics (Kanban): wmfdata-python's Hive query output includes logspam - https://phabricator.wikimedia.org/T275233 (10mpopov) @Milimetric: Great work! Very nice! I don't know if you've been made aware of it but wmfd... [22:07:51] (03PS1) 10Aklapper: Use dedicated Phabricator bug report / feature request forms [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/768167 [22:12:46] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 (10Ottomata) Hm, re grouping keys. We want to push metrics really only once per job per task. For any job run, each task wi... [23:30:43] 10Data-Engineering, 10Data-Engineering-Kanban: Out of disk space on stat1008 - https://phabricator.wikimedia.org/T303083 (10BTullis) It looks like it was related to this work. https://wikimedia.slack.com/archives/CSV483812/p1646426798963189?thread_ts=1646426798.963189&cid=CSV483812 Incident resolved. [23:31:07] 10Data-Engineering, 10Data-Engineering-Kanban: Out of disk space on stat1008 - https://phabricator.wikimedia.org/T303083 (10BTullis) 05Open→03Resolved