[05:41:27] 10Data-Engineering, 10API Platform (Sprint 04), 10AQS2.0, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10SGupta-WMF) a:05SGupta-WMF→03EChukwukere-WMF [08:06:46] 10Data-Engineering, 10Event-Platform Value Stream, 10FR-MW-Vagrant: EventBus should not blackhole undeclared streams - https://phabricator.wikimedia.org/T329480 (10Tgr) [08:07:14] 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-Vagrant: EventBus should not blackhole undeclared streams - https://phabricator.wikimedia.org/T329480 (10Tgr) [08:17:32] Thanks a lot mforns btullis milimetric for the skein certificate problem on my ops week! Without you, this morning would have been a mess. [08:44:30] (03CR) 10Gergő Tisza: [C: 03+2] image-suggestions-feedback: Bump to version 2.0.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/809150 (https://phabricator.wikimedia.org/T302925) (owner: 10Kosta Harlan) [08:45:16] (03Merged) 10jenkins-bot: image-suggestions-feedback: Bump to version 2.0.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/809150 (https://phabricator.wikimedia.org/T302925) (owner: 10Kosta Harlan) [08:52:20] !log disabled puppet agent on an-presto[1001-1015].eqiad.wmnet and an-coord1001.eqiad.wmnet to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/888214 on test cluster first only [08:52:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:59:54] !log restarting presto-worker on an-test-presto1001 to pick up new gc logging settings T329054 [08:59:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:59:57] T329054: Spike: Create a new presto coordinator - https://phabricator.wikimedia.org/T329054 [09:04:33] !log restarting presto-coordinator on an-test-coord1001 to pick up new gc logging settings T329054 [09:04:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:04:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:52] !log Rerun killed Oozie pageview-hourly-coord of 2023-02-11 with sudo -u analytics kerberos-run-command analytics oozie job --oozie $OOZIE_URL -rerun 0019103-210107075406929-oozie-oozi-C -date 2023-02-11T14:00Z::2023-02-11T16:00Z [09:08:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:15:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:20:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:54:19] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Create puppet profiles for the new ceph cluster - https://phabricator.wikimedia.org/T328123 (10BTullis) @JArguello-WMF - I've removed the epic tag. I think that was probably my fault. when I created i... [09:54:46] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) [09:54:57] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10BTullis) [09:55:09] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 (10BTullis) [09:55:09] I think some new event schema have been deployed 1.5 hours ago. (https://github.com/wikimedia/schemas-event-secondary/commit/5d85287cd71d997cf293831700713d6642b74fcf). Now the canary check systemd is in error for this schema. [09:55:09] Shouldn't we restart eventgate in this case? [09:55:18] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Stats clients to bullseye - https://phabricator.wikimedia.org/T329360 (10BTullis) [09:56:39] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) @JArguello-WMF this one is an epic and will need several child tickets. There are about 172 servers to be upgraded at the m... [09:57:01] !log re-enabled puppet agent on an-presto[1001-1015].eqiad.wmnet and an-coord1001.eqiad.wmnet [09:57:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:57:55] 10Data-Engineering-Planning: Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10BTullis) [09:58:09] hello folks! [09:58:26] Hi elukey [09:58:38] Andrew told me that there are rc streams for mediawiki.page_change to test, but I checked in the kafka topics and I found them empty :( [09:58:42] bonjour joal :) [09:59:32] tried eqiad.rc1.mediawiki.page_change and rc0 on kafka-main [09:59:44] !log restarting presto-coordinator on an-coord1001 to pick up new gc logging settings T329054 [09:59:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:59:46] T329054: Spike: Create a new presto coordinator - https://phabricator.wikimedia.org/T329054 [10:00:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:23] (03CR) 10DCausse: "lgtm, left some nits regarding code styles and a couple questions regarding Webrequest (but feel free to ignore them)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [10:05:45] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) a:03BTullis [10:06:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:22] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:31] (03PS1) 10Aqu: Fix typo in image suggestion schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/888642 (https://phabricator.wikimedia.org/T302925) [10:06:42] PROBLEM - Presto Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [10:06:49] (03CR) 10DCausse: Remove Guava from dependency (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [10:12:04] tgr: hello, I'm proposing a fix to the code deployed 2hours ago: https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/888642 . It seems that the schema is not valid (missing array item type). What do you think? [10:12:32] ah wait I see the streams on jumbo [10:15:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:22] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:23] !log Reimage an-test-worker1001 to upgrade to bullseye T329363 [10:15:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:15:25] T329363: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 [10:16:53] (03CR) 10DCausse: Remove Guava from dependency (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [10:19:14] RECOVERY - Presto Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [10:20:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:55] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-worker1001.eqiad.wmnet with... [10:45:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:45:37] 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 4 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Dreamy_Jazz) [10:45:46] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Presto is unstable with more than 5 worker nodes - https://phabricator.wikimedia.org/T325809 (10BTullis) [10:46:13] !log restarting presto-worker on an-presto[1001-1015].eqiad.wmnet to pick up new gc logging settings T329054 [10:46:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:46:16] T329054: Spike: Create a new presto coordinator - https://phabricator.wikimedia.org/T329054 [10:50:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:20] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [11:08:14] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [11:15:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:54] PROBLEM - Check systemd state on an-presto1007 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:22] PROBLEM - Presto Server on an-presto1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [11:20:30] PROBLEM - Check systemd state on an-presto1008 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:54] PROBLEM - Presto Server on an-presto1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [11:22:40] PROBLEM - Presto Server on an-presto1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [11:22:58] PROBLEM - Check systemd state on an-presto1006 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:52] 10Data-Engineering, 10Patch-For-Review: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10BTullis) [11:29:00] elukey: yeah, jumbo, here's a job that enriches that stream with content: https://gitlab.wikimedia.org/-/snippets/42 (nice snippet id gabriele :)) [11:34:30] 10Data-Engineering, 10Patch-For-Review: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10BTullis) I'm adding {T325809} as a parent ticket of this, since we may wish to return to this work on logging presto queries as part of the investigation into the the cause of that problem. It may... [11:35:34] btullis / nfraison: something's up with the an-presto(s), they're all alerting ^ [11:39:34] 10Data-Engineering, 10Patch-For-Review: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10BTullis) @Spaceliberty - Thanks for reaching out and apologies for the long delay in replying. Unfortunately, I don't think that we can offer much help to your question, as we do not currently run T... [11:43:09] aqu: indeed it seems that https://github.com/wikimedia/eventgate/blob/master/lib/EventValidator.js#L224 caches schemas indefinitely. I vote that we make a configurable cache duration so we don't have to restart EventGate, but indeed, in this case it needed a restart. I can +2 your schema fix as I think it's safe and I'd like to quiet the alarms. [11:43:26] (03CR) 10Milimetric: [C: 03+2] Fix typo in image suggestion schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/888642 (https://phabricator.wikimedia.org/T302925) (owner: 10Aqu) [11:44:27] PROBLEM - Presto Server on an-presto1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [11:45:19] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) The first worker reimage failed with a puppet error. https://puppetboard.wikimedia.org/report/an-test-worker1001.eqiad.wmnet/1... [11:45:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:37] PROBLEM - Check systemd state on an-presto1014 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:06] 10Data-Engineering, 10Data-Catalog, 10Infrastructure-Foundations, 10CAS-SSO, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) [11:56:33] 10Data-Engineering, 10Data-Catalog, 10Infrastructure-Foundations, 10CAS-SSO, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) p:05Triage→03Medium We have decided that we would like to try to imple... [12:13:53] milimetric: kind of false alert, it is due to restart of presto on an-presto1006 to an-presto1015 on which puppet then sop them as those nodes are currently deactivated. Probably somehting that will have to be improved on the restart cookbook [12:14:48] I will see to downtime the alerts [12:26:20] ACKNOWLEDGEMENT - Check systemd state on an-presto1006 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service Nicolas Fraison Presto is currently disabled on those nodes while working on T325809 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:20] ACKNOWLEDGEMENT - Presto Server on an-presto1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer Nicolas Fraison Presto is currently disabled on those nodes while working on T325809 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [12:26:20] ACKNOWLEDGEMENT - Check systemd state on an-presto1007 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service Nicolas Fraison Presto is currently disabled on those nodes while working on T325809 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:20] ACKNOWLEDGEMENT - Presto Server on an-presto1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer Nicolas Fraison Presto is currently disabled on those nodes while working on T325809 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [12:30:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:35:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:01] aha, thx [12:40:08] Thanks milimetric for the merge. Do you know how to restart the eventgate service now ? Shall we do it ? [12:40:53] I think we should, but doubt we have rights... actually I'm not sure how to do it, I didn't find docs [12:44:14] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10nfraison) a:03nfraison [12:45:59] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) OK, I understand the context for this failure more now. Here's the most succinct comment on why the `bigtop::mysql_jdbc` class... [12:52:51] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10nfraison) 05Open→03In progress [12:52:53] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10nfraison) [12:55:33] FI I will reimage the only test presto worker an-test-presto1001. Is it fin if I start the operation now or do you need it today for some testing? [13:09:16] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Cleanup User Hive Databases - https://phabricator.wikimedia.org/T323884 (10EChetty) [13:09:20] 10Data-Engineering-Planning, 10Data Pipelines, 10Shared-Data-Infrastructure: [Iceberg] Debianize and install iceberg support for Spark, Presto, and optionally Hive - https://phabricator.wikimedia.org/T311738 (10EChetty) [13:18:51] milimetric: Did you find someone to restart eventgate? If not, which of the eventgate services do you need restarted? Can you link the CR? [13:18:56] https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#Roll_restart_all_pods [13:19:20] Yes, this is the correct way [13:19:22] I think nfraison or btullis can do it, but I'm still not 100% sure that's what we need [13:19:49] it doesn't mention anywhere that we would need to restart everything just for a schema change... and it seems extreme, but the code seems to indicate that's the case [13:19:57] like, I don't see any code that evicts the schema cache [13:20:04] And yes, n.fraison and b.tullis have access to deploy1002 and can do it [13:24:30] claime: I'll happily do it. [13:24:31] I don't know much about the service itself, so can't help you with knowing if it needs a restart or not. [13:25:36] btullis: Cool, leaving it up to you then :) [13:26:53] claime: ack. I'm just looking to see who actually does have the required rights. I don't think that it's SRE only, but I can come back to that anyway. [13:27:24] btullis: I think it's anyone with deployment rights, you don't need sudo for helmfile [13:29:12] Yeah, I think it's members of the deployment group [13:29:21] Right. Currently it looks like aqu doesn't have deployment rights, but milimetric and joal (for instance) do. Checked with `getent group deployment|grep aqu` on deploy1002. [13:30:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:01] 10Data-Engineering-Planning, 10Equity-Landscape: Load language data - https://phabricator.wikimedia.org/T315886 (10ntsako) Table loaded with the below query so as to use our country meta data table instead of canonical countries table. ` WITH all_countries AS ( SELECT DISTINCT * FROM (... [13:33:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:12] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Presto is unstable with more than 5 worker nodes - https://phabricator.wikimedia.org/T325809 (10EChetty) [13:40:14] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Alluxio for Improved Superset Query Performance - https://phabricator.wikimedia.org/T288252 (10EChetty) [13:40:16] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Analytics Presto improvements - https://phabricator.wikimedia.org/T266639 (10EChetty) [13:40:53] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10EChetty) 05In progress→03Open [13:40:56] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10EChetty) a:05BTullis→03nfraison [13:42:41] btullis: thanks, I vaguely remember getting those rights but I think I used it once like three years ago or something like that. Did you restart EventGate or should I? [13:44:41] milimetric: I will do it. Got called into a meeting, but will do it afterwards. [13:44:50] thx! [13:45:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:57] 10Data-Engineering-Planning, 10Equity-Landscape: Load language data - https://phabricator.wikimedia.org/T315886 (10KCVelaga_WMF) @ntsako That's a good idea. Thanks for the improvements. [13:47:03] 10Data-Engineering, 10Event-Platform Value Stream: jsonschema-tools tests - ensure that array items type is set - https://phabricator.wikimedia.org/T329515 (10Ottomata) [13:49:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:18] milimetric: are you sure aqu doesn't have deploy rights? [14:01:24] and if not, perhaps he should! [14:01:49] you probably do. [14:02:03] try this [14:02:15] ssh to deploy1002.eqiad.wmnet [14:02:25] cd /srv/deployment-charts/helmfile.d/services/eventgate-main [14:02:28] helmfile status [14:02:55] if you can do that you can probably do the roll restart [14:03:01] oh [14:03:03] helmfile -e eqiad status [14:03:28] ottomata: I agree that aqu should probably be in the deployers group, but for now I'm happy to restart eventgate. [14:04:26] yeah, I think the whole team should have deploy rights, maybe it should be part of analytics-admins? But yes, I saw the docs now, I linked it above [14:04:46] What ottomata: What order would you do? main, analytics, analytics-external, logging-external ? [14:05:11] Excuse the stray (What) above please :-) [14:06:34] !log Reimage an-test-presto1001 to upgrade to bullseye T329361 [14:06:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:06:36] T329361: Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 [14:10:48] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by nfraison@cumin1001 for host an-test-presto1001.eqiad.wmnet with OS... [14:11:12] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by nfraison@cumin1001 for host an-test-presto1001.eqiad.wmnet with OS bull... [14:13:52] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): Deploy mediawiki-event-enrichment flink app to DSE k8s - https://phabricator.wikimedia.org/T325305 (10Ottomata) [14:14:19] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10nfraison) Reimage failed ` For info, please visit https://www.isc.org/software/dhcp/ /etc/dhcp/automation/ttyS0-115200/an-test-presto1001.c... [14:14:30] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): Flink Enrichment monitoring - https://phabricator.wikimedia.org/T328925 (10Ottomata) [14:15:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:29] !log roll-restarting all eventgate pods [14:15:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:19:37] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10nfraison) rm manually the ttyS0-115200/an-test-presto1001.con file as indicated in doc https://wikitech.wikimedia.org/wiki/Server_Lifecycle/... [14:20:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:02] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by nfraison@cumin1001 for host an-test-presto1001.eqiad.wmnet with OS... [14:22:10] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by nfraison@cumin1001 for host an-test-presto1001.eqiad.wmnet with OS bull... [14:29:45] btullis:only main is needed [14:32:08] btullis: curl 'https://meta.wikimedia.org/w/api.php?action=streamconfigs&all_settings=1&streams=mediawiki.image_suggestions_feedback' | jq .streams [14:32:13] destination_event_service": "eventgate-main [14:38:23] 10Data-Engineering, 10Event-Platform Value Stream: Refactor Image Suggestions Feedback > Cassandra Flink Job and Deploy to DSE k8s - https://phabricator.wikimedia.org/T329524 (10lbowmaker) [14:38:41] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by nfraison@cumin1001 for host an-test-presto100... [14:38:49] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by nfraison@cumin1001 for host an-test-presto1001.eq... [14:42:45] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by nfraison@cumin1001 for host an-test-presto100... [14:42:50] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by nfraison@cumin1001 for host an-test-presto1001.eq... [14:44:37] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) Having discussed it with @eluky, I'm going to proceed with option 1) above. Namely, trying version 2.7.2 of libmariadb-java T... [14:46:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:08] ottomata: Ah, thanks. That is handy. [14:46:48] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by nfraison@cumin1001 for host an-test-presto100... [14:53:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:17] 10Data-Engineering, 10Event-Platform Value Stream: Refactor Image Suggestions Feedback > Cassandra Flink Job and Deploy to DSE k8s - https://phabricator.wikimedia.org/T329524 (10tchin) [[ https://gitlab.wikimedia.org/repos/generated-data-platform/image-suggestions-feedback/-/tree/main/ | Looking back at the co... [15:00:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:02] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Presto is unstable with more than 5 worker nodes - https://phabricator.wikimedia.org/T325809 (10nfraison) [15:27:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:01] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10nfraison) puppet failed on presto test ` Feb 13 15:22:48 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Rsyslog/Service[rsyslog]) Ski... [15:57:21] 10Data-Engineering-Planning, 10Data Pipelines, 10Release-Engineering-Team, 10serviceops-collab, 10GitLab (CI & Job Runners): Experiencing pipeline failure due to disk-space issues - https://phabricator.wikimedia.org/T310593 (10dancy) 05Open→03Resolved a:03dancy @JAllemandou This should be resolved... [15:59:01] (03PS2) 10Snwachukwu: Update Webrequest table to include referer_data column. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/887371 (https://phabricator.wikimedia.org/T327074) [15:59:44] (03CR) 10Snwachukwu: Update Webrequest table to include referer_data column. (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/887371 (https://phabricator.wikimedia.org/T327074) (owner: 10Snwachukwu) [15:59:46] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by nfraison@cumin1001 for host an-test-presto1001.eqiad.wmnet with OS bull... [16:01:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:53] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10nfraison) And then this one ` Error: Cannot create /srv/presto/var/log; parent directory /srv/presto/var does not exist Error: /Stage[main]... [16:10:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:32] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [16:15:46] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:19] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10Gehel) [16:22:20] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Gehel) [16:26:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:49] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:09] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:45:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:54:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:08:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:15:51] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:30:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:35:27] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:51:41] Starting build #18 for job wikimedia-event-utilities-maven-release-docker [20:54:42] Project wikimedia-event-utilities-maven-release-docker build #18: 09SUCCESS in 3 min 2 sec: https://integration.wikimedia.org/ci/job/wikimedia-event-utilities-maven-release-docker/18/ [20:56:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:07:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:33:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:39:33] !log enabled rc1.mediawiki.page_change stream on group0 and group1 wikis [21:39:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:39:51] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): Deploy mediawiki-event-enrichment flink app to DSE k8s - https://phabricator.wikimedia.org/T325305 (10Ottomata) [21:40:03] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): Deploy mediawiki-event-enrichment flink app to DSE k8s - https://phabricator.wikimedia.org/T325305 (10Ottomata) [21:40:30] !log deploying section_topics v0.5.0 on platform_eng Airflow instance [21:40:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:45:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:53:06] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:01:42] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:08:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:23:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:30:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:34:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [22:35:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:44:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [22:45:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:54:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:16:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:23:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:30:44] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:36:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:45:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:52:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state