[02:27:06] 10Data-Engineering, 10API Platform (Sprint 03), 10AQS2.0, 10Code-Health-Objective, and 2 others: Dashboards for AQS 2.0 - https://phabricator.wikimedia.org/T288667 (10VirginiaPoundstone) [02:29:10] 10Data-Engineering, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), 10Epic, 10Platform Engineering Roadmap: Route Device Analytics through API Gateway - https://phabricator.wikimedia.org/T326699 (10VirginiaPoundstone) [02:29:38] 10Data-Engineering, 10API Platform, 10AQS 2.0 Roadmap, 10Epic, 10Platform Engineering Roadmap: Route Device Analytics through API Gateway - https://phabricator.wikimedia.org/T326699 (10VirginiaPoundstone) [02:33:30] 10Data-Engineering, 10API Platform, 10AQS 2.0 Roadmap, 10Epic, 10Platform Engineering Roadmap: Route Device Analytics through API Gateway - https://phabricator.wikimedia.org/T326699 (10VirginiaPoundstone) [02:41:37] 10Data-Engineering, 10API Platform, 10AQS 2.0 Roadmap, 10Epic, 10Platform Engineering Roadmap: Route Device Analytics through API Gateway - https://phabricator.wikimedia.org/T326699 (10VirginiaPoundstone) [02:41:39] 10Data-Engineering, 10API Platform, 10AQS 2.0 Roadmap, 10Epic, and 2 others: Create k8s deployment of AQS 2.0 - https://phabricator.wikimedia.org/T288661 (10VirginiaPoundstone) [02:42:30] 10Data-Engineering-Planning, 10API Platform, 10GraphQL, 10Pageviews-API: Responses on pageview API should be lighter - https://phabricator.wikimedia.org/T145935 (10VirginiaPoundstone) [02:42:37] 10Data-Engineering-Planning, 10API Platform, 10GraphQL, 10Pageviews-API: Responses on pageview API should be lighter - https://phabricator.wikimedia.org/T145935 (10VirginiaPoundstone) [02:57:19] 10Data-Engineering, 10API Platform, 10AQS 2.0 Roadmap, 10Epic, 10Platform Engineering Roadmap: Route Device Analytics through API Gateway - https://phabricator.wikimedia.org/T326699 (10VirginiaPoundstone) 05Open→03Invalid [02:57:23] 10Data-Engineering, 10API Platform, 10AQS 2.0 Roadmap, 10Epic, and 2 others: Create k8s deployment of AQS 2.0 - https://phabricator.wikimedia.org/T288661 (10VirginiaPoundstone) [02:57:25] 10Data-Engineering, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), 10Epic, and 2 others: AQS 2.0: Unique Devices service - https://phabricator.wikimedia.org/T288298 (10VirginiaPoundstone) [02:59:37] 10Data-Engineering, 10API Platform (Sprint 04), 10AQS2.0, 10Code-Health-Objective, and 2 others: Dashboards for AQS 2.0 - https://phabricator.wikimedia.org/T288667 (10VirginiaPoundstone) [04:21:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:18:32] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1001.eqiad.wmne... [10:41:52] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10BTullis) >>! In T324670#8513627, @fnegri wrote: > We had similar issues with `cloudcephosd*` hosts, where the device name... [10:54:20] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet wi... [11:12:53] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1003.eqiad.wmne... [12:10:21] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1003.eqiad.wmnet wi... [12:11:05] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1004.eqiad.wmne... [12:24:14] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for bjnwiktionary - https://phabricator.wikimedia.org/T312214 (10BTullis) I believe that this is now working. Apologies once again for the delay @Bennylin . ` btullis@tools-sgebastion-10:~$ sql bjnwiktionary Reading table information... [13:45:08] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Deployment pipeline docker image of flink mediawiki stream enrichment pyhon - https://phabricator.wikimedia.org/T326731 (10Ottomata) [14:02:31] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1004.eqiad.wmnet wi... [14:06:40] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1005.eqiad.wmne... [14:12:08] 10Data-Engineering-Planning, 10Patch-For-Review: Create conda .deb and docker image - https://phabricator.wikimedia.org/T304450 (10Ottomata) [14:45:42] 10Data-Engineering, 10API Platform (Sprint 03), 10AQS2.0, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10JArguello-WMF) [14:46:57] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1005.eqiad.wmnet wi... [15:13:59] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream: Enable canary events for all streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) [15:16:18] 10Analytics-Radar, 10Data-Engineering-Radar, 10Event-Platform Value Stream: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 (10Ottomata) [15:17:00] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Spike: [SPIKE] Investigate using Knative Eventing - https://phabricator.wikimedia.org/T318862 (10lbowmaker) 05Open→03Declined [15:19:07] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Platform Team Workboards (Clinic Duty Team): Adopt conventions for server receive and client/event timestamps in non analytics event schemas - https://phabricator.wikimedia.org/T267648 (10JArguello-WMF) p:05Medium→03Triage [15:21:17] 10Data-Engineering, 10Epic, 10Event-Platform Value Stream (Sprint 07): Deployment pipeline docker image of flink mediawiki stream enrichment pyhon - https://phabricator.wikimedia.org/T326731 (10JArguello-WMF) [15:22:23] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 07): Deployment pipeline docker image of flink mediawiki stream enrichment pyhon - https://phabricator.wikimedia.org/T326731 (10JArguello-WMF) [15:24:00] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 07): Deployment pipeline docker image of flink mediawiki stream enrichment pyhon - https://phabricator.wikimedia.org/T326731 (10Ottomata) p:05Triage→03High a:03gmodena [15:32:58] 10Data-Engineering, 10API Platform (Sprint 03), 10AQS2.0, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10VirginiaPoundstone) a:05Emeka-okechukwu→03EChukwukere-WMF [15:34:01] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10BTullis) 05In progress→03Resolved >>! In T324670#8512470, @jbond wrote: > @BTullis Seems you have allready gone throu... [15:34:03] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10BTullis) [15:44:58] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10BTullis) [15:47:34] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10BTullis) [15:50:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp5031 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5031%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:55:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5031 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5031%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:55:41] 10Data-Engineering, 10Shared-Data-Infrastructure, 10Epic: Data Infrastructure as a Service MVP - https://phabricator.wikimedia.org/T308317 (10BTullis) [15:58:26] 10Data-Engineering, 10Shared-Data-Infrastructure, 10Epic: Data Infrastructure as a Service MVP - https://phabricator.wikimedia.org/T308317 (10BTullis) [15:58:47] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 4.374 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [16:08:19] RECOVERY - eventgate-analytics-external validation error rate too high on alert1001 is OK: (C)2 gt (W)1 gt 0.6374 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [18:27:55] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [EPIC] Flink Applications on Kubernetes - https://phabricator.wikimedia.org/T324578 (10Ottomata) [18:30:13] (VarnishkafkaNoMessages) firing: varnishkafka on cp5021 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5021%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:30:13] (VarnishkafkaNoMessages) firing: varnishkafka on cp5026 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5026%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:35:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5021 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5021%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:35:13] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5026 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5026%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:41:28] 10Data-Engineering-Planning, 10serviceops, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) Hm, trying to deploy flink-app-example is erroring, I think we need some ex... [18:43:19] 10Data-Engineering-Planning, 10serviceops, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10JMeybohm) Oops, yeah. We are pretty restrictive with permissions for deployment users... [19:06:51] 10Data-Engineering-Planning, 10serviceops, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) Or, hm, @JMeybohm @BTullis, is this because I am deploying into a namespac... [19:09:02] 10Data-Engineering-Planning, 10serviceops, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10JMeybohm) >>! In T324576#8517486, @Ottomata wrote: > Or, hm, @JMeybohm @BTullis, is... [19:23:39] 10Data-Engineering-Planning, 10serviceops, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) Getting somewhere! App was deployed, but: `lang=json { "@timestamp": "2... [20:40:59] (03CR) 10Xcollazo: "Thank you for this patch!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/868754 (https://phabricator.wikimedia.org/T302500) (owner: 10Aqu) [20:41:54] 10Data-Engineering-Planning, 10serviceops, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > Need more networkpolicy? ... Hm no. I think 'flink-app-main.stream-enric...