[01:33:34] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:15:31] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10Stevemunene) [02:15:34] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10Stevemunene) [04:17:47] 10Data-Engineering: Make HTML Dumps available in hadoop - https://phabricator.wikimedia.org/T305688 (10fkaelin) Thanks @MGerlach - the most recent run of the data on /mnt/data is from October 2022. Luckily I had already started the download for the enwiki as well, so I went ahead an put the March 20th 2023 html... [05:33:34] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:08:24] 10Data-Engineering, 10serviceops, 10Epic, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10JMeybohm) [07:09:18] 10Data-Engineering, 10serviceops, 10Epic, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10JMeybohm) Could you please share resource requirements for the operator from your experiments on DSE here... [07:53:34] (SystemdUnitFailed) firing: (8) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:54:37] (SystemdUnitFailed) firing: (8) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:04:37] (SystemdUnitFailed) firing: (9) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:30] 10Data-Engineering, 10SRE, 10SRE Observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10ayounsi) > @ayounsi - are you able to confirm trat dropped packets are no longer a problem for this host from the logstash firewall dashboards? I confirm. [08:08:34] (SystemdUnitFailed) firing: (9) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:33:18] I'm going to look into this `wmf_auto_restart_envoyproxy.service` failure on an-test-ui1001 today. I don't think there is an envoyproxy installed. [08:35:51] IIRC there may be one, we added it when the traffic team wanted TLS conns between ATS in various DCs and DE nodes/services [08:36:39] ah no there is only the restart stuff [08:36:51] ahhh ok the test server didn't have any ATS config, my bad [08:37:08] scratch the nonsense I wrote, I wanted to help but I said stupid things :) [08:38:11] elukey: It's always helpful :-) [08:39:08] btullis: ;) - unrelated qs - is it ok if we rollout https://gerrit.wikimedia.org/r/c/operations/puppet/+/902107 later on? [08:48:02] elukey: Yep, feel free. [08:49:19] ack! [09:25:38] 10Data-Engineering, 10SRE, 10SRE Observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10fgiunchedi) >>! In T238794#8738885, @BTullis wrote: > @fgiunchedi - is this just a matter of removing some old config now? Or is there another reason why we're not seeing traff... [09:27:14] 10Data-Engineering, 10Equity-Landscape: Milestone: Create and Publish Data Visualisation Views: - https://phabricator.wikimedia.org/T305480 (10ntsako) a:03ntsako [09:27:44] 10Data-Engineering, 10Equity-Landscape: Milestone: Publish the Dashboard! - https://phabricator.wikimedia.org/T305481 (10ntsako) a:03ntsako [09:27:48] !log Deploying refinery using scap [09:27:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:28:06] 10Data-Engineering, 10Equity-Landscape: Milestone: Dashboard Template Complete - https://phabricator.wikimedia.org/T305479 (10ntsako) a:05okwiri_oduor→03ntsako [09:28:28] 10Data-Engineering, 10Equity-Landscape: Milestone: Data Visualization Table Views defined - https://phabricator.wikimedia.org/T305478 (10ntsako) a:03ntsako [09:28:38] 10Data-Engineering, 10Equity-Landscape, 10Epic: Deploy the GDI Equity Landscape Dashboard - https://phabricator.wikimedia.org/T305468 (10ntsako) a:03ntsako [09:38:24] !log Deploying refinery onto HDFS [09:38:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:48:08] !log Deploy airflow analytics [09:48:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:56:10] !log re-running refine_event [09:56:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:02:54] 10Data-Engineering, 10Data Pipelines, 10WMDE-TechWish-Maintenance: Migrate or deprecate WMDE Technical Wishes reportupdater jobs - https://phabricator.wikimedia.org/T333537 (10awight) [11:48:49] Hi mforns [11:56:33] !log Kill oozie referer_daily job - migrated to airflow [11:56:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:08:34] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:11:09] !log Kill virtualpageview oozie job - migrated to airflow [12:11:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:13:58] 10Data-Engineering, 10Data-Services, 10VPS-Projects, 10WMDE-References-FocusArea, 10WMDE-TechWish-Sprint-2023-03-14: Requesting Cloud VPS access to NFS mount /public/dumps - https://phabricator.wikimedia.org/T333549 (10awight) [12:15:12] 10Data-Engineering, 10WMDE-References-FocusArea, 10WMDE-TechWish-Sprint-2023-03-14: Shut down our previous Cloud VPS project and create a new one - https://phabricator.wikimedia.org/T332040 (10awight) [12:15:20] 10Data-Engineering, 10Data-Services, 10VPS-Projects, 10WMDE-References-FocusArea, 10WMDE-TechWish-Sprint-2023-03-14: Requesting Cloud VPS access to NFS mount /public/dumps - https://phabricator.wikimedia.org/T333549 (10awight) [12:16:14] 10Data-Engineering, 10WMDE-References-FocusArea, 10WMDE-TechWish-Sprint-2023-03-14: Shut down our previous Cloud VPS project and create a new one - https://phabricator.wikimedia.org/T332040 (10awight) 05Open→03Resolved a:03awight [12:22:53] ping mforns in case you're around [12:32:10] !log Deploy airflow hotfix for referer_daily [12:32:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:38:52] joal: o/ I am seeing some tcp RST related to TLS handshakes on kafka-jumbo1001, the ips are all an-worker-related [12:39:03] is there anything else other than gobblin that pulls from kafka? [12:39:25] elukey: some flink jobs do [12:40:23] ahh interesting [12:40:30] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Investigate DB connection issues faced from airflow on an-launcher1002 - https://phabricator.wikimedia.org/T331265 (10JArguello-WMF) a:05nfraison→03None [12:41:01] joal: search-related flink jobs? [12:41:17] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10JArguello-WMF) [12:41:31] elukey: search, and possibly page-change [12:42:01] 10Data-Engineering, 10Data-Services, 10VPS-Projects, 10WMDE-References-FocusArea, 10WMDE-TechWish-Sprint-2023-03-14: Requesting Cloud VPS access to NFS mount /public/dumps - https://phabricator.wikimedia.org/T333549 (10awight) 05Open→03Resolved a:03awight Well, that was fast! Thanks again :-D [12:42:20] elukey: I think page-change is currently running with k8s, but some jobs run in Yarn I think [12:42:35] joal: mmm page-change should be on DSE using kafka-main, but search jobs may be using it [12:47:51] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Refactor analytics-meta MariaDB layout to use an-mariadb100[12] - https://phabricator.wikimedia.org/T284150 (10JArguello-WMF) p:05High→03Medium [12:48:07] the other alternative is that gobblin, for some reason, doesn't like the new cert [12:49:00] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10JArguello-WMF) a:05BTullis→03Stevemunene [12:56:08] ottomata: I'm sorry I won't make it to the meeting with TNG - I was feeling not usefull in previous meetings, so I guess it's ok :) [13:05:38] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Investigate DB connection issues faced from airflow on an-launcher1002 - https://phabricator.wikimedia.org/T331265 (10BTullis) 05Open→03Resolved a:03BTullis To wrap up this investigation, here's a brief summary as I... [13:10:55] ok so I tested the same tshark filter on an hadoop worker node, I see some tcp rsts for port 9093, but also for other non-pki brokers [13:11:02] so it may be how the kafka client behaves [15:13:37] 10Data-Engineering-Planning, 10Data-Catalog, 10Shared-Data-Infrastructure: Review and improve the build process for DataHub containers - https://phabricator.wikimedia.org/T303381 (10BTullis) a:03BTullis [15:14:39] (03PS10) 10Aqu: Migrate refine webrequest to Airflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073) [15:15:08] 10Data-Engineering-Planning, 10Data-Catalog, 10Shared-Data-Infrastructure: Review and improve the build process for DataHub containers - https://phabricator.wikimedia.org/T303381 (10BTullis) [15:25:06] 10Analytics-Radar, 10Data-Engineering-Radar, 10Event-Platform Value Stream, 10Patch-For-Review: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 (10elukey) I've read https://www.golinuxcloud.com/troubleshooting-tls-failures-wireshark/ and found the following tsha... [15:28:34] 10Data-Engineering-Planning, 10Data-Catalog, 10Shared-Data-Infrastructure: Review and improve the build process for DataHub containers - https://phabricator.wikimedia.org/T303381 (10BTullis) [15:39:12] 10Data-Engineering, 10Event-Platform Value Stream: mediwiki-event-enrichment in k8s should use mwapi-async envoy listener for stream config in - https://phabricator.wikimedia.org/T333575 (10Ottomata) [15:50:15] 10Analytics-Radar, 10DC-Ops, 10SRE, 10SRE-swift-storage, 10ops-eqiad: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10Jclark-ctr) @jbond i have batteries for all of these can this be done tomorrow? If possible can you shut down server and I can preform repair 9am est tomorrow? [15:52:34] joal that's fine! its optional, only if you want to come! :) [15:53:18] 10Analytics-Radar, 10DC-Ops, 10SRE, 10SRE-swift-storage, 10ops-eqiad: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10jbond) @Jclark-ctr you will need to contacts someone in analytics (possibly @BTullis) and data persistence (maybe @MatthewVernon) [16:08:59] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:56:44] 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [17:12:09] 10Data-Engineering, 10Advanced-Search, 10All-and-every-Wikisource, 10ArticlePlaceholder, and 66 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10EBernhardson) [17:20:41] !log killed Oozie mediawiki-history-check_denormalize job and started Airflow mediawiki_history_check_denormalize dag. [17:20:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:22:26] 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10BTullis) @Cmjohnson I can't think of any reason why six disks should have failed. I think they're all single volume RAID 0 logical volumes, aren't they? We've power cycled it a few times with... [17:30:14] !log deployed airflow analytics - mediawiki_wikitext dags [17:30:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:28:38] !log deployed hotfix for airflow mediawiki_wikitext_current and mediawiki_wikitext_history dags. [18:28:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:31:15] !log Killed Oozie mediawiki-wikitext-history-coord and mediawiki-wikitext-current-coord [18:31:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:32:10] !log started Airflow mediwiki wikitext dags after killing oozie jobs as part of Migration task. [18:32:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:13:10] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:23:59] (03CR) 10Bartosz Dziewoński: [C: 03+2] EditAttemptStep: Add a new abort type for page updates [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/903845 (https://phabricator.wikimedia.org/T301582) (owner: 10DLynch) [20:24:39] (03Merged) 10jenkins-bot: EditAttemptStep: Add a new abort type for page updates [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/903845 (https://phabricator.wikimedia.org/T301582) (owner: 10DLynch) [20:33:10] (SystemdUnitFailed) firing: (8) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:42:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:48:10] (SystemdUnitFailed) firing: (8) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:07:20] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) @BTullis what HW raid to not in task [23:34:10] 10Data-Engineering-Planning, 10Data-Catalog, 10Shared-Data-Infrastructure: Review and improve the build process for DataHub containers - https://phabricator.wikimedia.org/T303381 (10BTullis)