[03:25:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:00:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:24:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:00:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:24:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:01:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:25:24] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:16:00] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:23:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:05] hi team - we have an issue with events this morning [08:00:50] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:17] Interesting - looks like we have timers that are in a weird state [08:24:24] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:28:40] Ok - the problem is a refine job being stuck: application_1663082229270_510052 [08:30:04] This job has been launched yesterday at about 21:30 and its last logging time was 00:30 this morning - something wrong must have happend [08:30:54] I'm gonna kill that job so that new isntances get launched [08:38:13] !log Kill refine_eventlogging_legacy stuck job (application_1663082229270_510052) [08:38:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:59:01] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10JMeybohm) >>! In T324576#8454404, @Ottomata wrote: >> I will test and see what happens to a running Flink app when I take the opera... [09:00:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:45] 10Data-Engineering, 10All-and-every-Wiktionary: Add editors per country data for Wiktionary projects - https://phabricator.wikimedia.org/T266643 (10Pamputt) Is there any plan for the data engineers to work on that topic? It is enable for Wikipedia since a while and [https://stats.wikimedia.org/#/fr.wiktionary... [12:11:27] (03PS5) 10Milimetric: [WIP] Stream revision topics into iceberg table [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/858344 (https://phabricator.wikimedia.org/T322326) [12:16:54] (03CR) 10CI reject: [V: 04-1] [WIP] Stream revision topics into iceberg table [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/858344 (https://phabricator.wikimedia.org/T322326) (owner: 10Milimetric) [13:31:46] (03CR) 10Ottomata: [C: 03+2] Add schema for Extension:SearchVue actions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/845498 (https://phabricator.wikimedia.org/T321069) (owner: 10Matthias Mullie) [13:32:21] (03Merged) 10jenkins-bot: Add schema for Extension:SearchVue actions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/845498 (https://phabricator.wikimedia.org/T321069) (owner: 10Matthias Mullie) [13:52:20] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > cert-manager in our cluster Ah! Do we have a cert-manager that will work with this webhook as is? Meaning we could ins... [14:32:58] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power supply for an-worker1184.eqiad.wmnet - https://phabricator.wikimedia.org/T324559 (10Jclark-ctr) 05Open→03Resolved @BTullis Reseated power supply2 fault light cleared on rear of server [14:41:29] RECOVERY - IPMI Sensor Status on an-worker1148 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:42:47] 10Data-Engineering-Planning, 10Data Pipelines, 10Patch-For-Review, 10Technical-Debt: Productionize HDFS fsimage data analysis job - https://phabricator.wikimedia.org/T261283 (10Antoine_Quhen) [14:43:21] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > what the webhook actually does Responses from Flink mailing list: > webhooks in general are optional components of the... [16:21:49] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-stretch1001.eqiad.w... [16:21:54] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-stretch1001.eqiad.wmnet... [16:27:51] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Cmjohnson) @papaul When I try to image these servers, the process fails immediately. This is the error I receive. Any ideas on... [16:27:56] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-stretch1002.eqiad.w... [16:44:05] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Papaul) @Cmjohnson try to delete the kafka-stretch1001.conf on install1003 and try again an let me know [16:58:58] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-stretch1002.eqiad.wmnet... [17:03:40] 10Analytics, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), 10Epic, and 2 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10apaskulin) [17:08:15] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM! Thanks :]" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/859652 (https://phabricator.wikimedia.org/T323664) (owner: 10MNeisler) [18:26:45] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) Q for @dcausse and @gmodena. I've thus far been making Flink logs go only to the console in ECS format. The console log... [18:31:33] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) @JMeybohm it turns out that Flinks native k8s integration [[ https://nightlies.apache.org/flink/flink-docs-master/docs/de... [18:54:20] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Cmjohnson) KS1002 was installed without an issue, I started over with KS1001 but the mgmt IP address changed and the provision... [19:15:34] 10Data-Engineering-Planning, 10SRE, 10WMF-Communications: LDAP access for Sondes to access Matomo - https://phabricator.wikimedia.org/T324696 (10EdErhart-WMF) [19:23:03] 10Data-Engineering-Planning, 10SRE, 10WMF-Communications: LDAP access for Sondes to access Matomo - https://phabricator.wikimedia.org/T324696 (10EdErhart-WMF) Tagging @Marostegui and @jcrespo per their recent involvement with LDAP access requests [20:05:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:16:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:08] 10Data-Engineering-Planning, 10SRE, 10WMF-Communications: LDAP access for Sondes to access Matomo - https://phabricator.wikimedia.org/T324696 (10jcrespo) Hi, @EdErhart-WMF . There is no need to tag anyone- SRE has a clinic duty procedure in which someone on rotation attends LDAP requests every week. I sugges... [22:50:34] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) Okay @JMeybohm, I'm ready for a first pass review of the [[ https://gerrit.wikimedia.org/r/c/operations/deployment-charts... [22:50:54] 10Data-Engineering, 10Event-Platform Value Stream: [EPIC] Flink Applications on Kubernetes - https://phabricator.wikimedia.org/T324578 (10Ottomata)