[01:33:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:15:31] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10Stevemunene)
[02:15:34] <wikibugs>	 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10Stevemunene)
[04:17:47] <wikibugs>	 10Data-Engineering: Make HTML Dumps available in hadoop - https://phabricator.wikimedia.org/T305688 (10fkaelin) Thanks @MGerlach - the most recent run of the data on /mnt/data is from October 2022. Luckily I had already started the download for the enwiki as well, so I went ahead an put the March 20th 2023 html...
[05:33:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:08:24] <wikibugs>	 10Data-Engineering, 10serviceops, 10Epic, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10JMeybohm)
[07:09:18] <wikibugs>	 10Data-Engineering, 10serviceops, 10Epic, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10JMeybohm) Could you please share resource requirements for the operator from your experiments on DSE here...
[07:53:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:54:37] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:04:37] <jinxer-wm>	 (SystemdUnitFailed) firing: (9) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:08:30] <wikibugs>	 10Data-Engineering, 10SRE, 10SRE Observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10ayounsi) > @ayounsi - are you able to confirm trat dropped packets are no longer a problem for this host from the logstash firewall dashboards? I confirm.
[08:08:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (9) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:33:18] <btullis>	 I'm going to look into this `wmf_auto_restart_envoyproxy.service` failure on an-test-ui1001 today. I don't think there is an envoyproxy installed.
[08:35:51] <elukey>	 IIRC there may be one, we added it when the traffic team wanted TLS conns between ATS in various DCs and DE nodes/services
[08:36:39] <elukey>	 ah no there is only the restart stuff
[08:36:51] <elukey>	 ahhh ok the test server didn't have any ATS config, my bad
[08:37:08] <elukey>	 scratch the nonsense I wrote, I wanted to help but I said stupid things :)
[08:38:11] <btullis>	 elukey: It's always helpful :-) 
[08:39:08] <elukey>	 btullis: ;) - unrelated qs - is it ok if we rollout https://gerrit.wikimedia.org/r/c/operations/puppet/+/902107 later on?
[08:48:02] <btullis>	 elukey: Yep, feel free.
[08:49:19] <elukey>	 ack!
[09:25:38] <wikibugs>	 10Data-Engineering, 10SRE, 10SRE Observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10fgiunchedi) >>! In T238794#8738885, @BTullis wrote:  > @fgiunchedi - is this just a matter of removing some old config now? Or is there another reason why we're not seeing traff...
[09:27:14] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Milestone: Create and Publish Data Visualisation Views: - https://phabricator.wikimedia.org/T305480 (10ntsako) a:03ntsako
[09:27:44] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Milestone: Publish the Dashboard! - https://phabricator.wikimedia.org/T305481 (10ntsako) a:03ntsako
[09:27:48] <joal>	 !log Deploying refinery using scap
[09:27:49] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:28:06] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Milestone: Dashboard Template Complete - https://phabricator.wikimedia.org/T305479 (10ntsako) a:05okwiri_oduor→03ntsako
[09:28:28] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Milestone: Data Visualization Table Views defined - https://phabricator.wikimedia.org/T305478 (10ntsako) a:03ntsako
[09:28:38] <wikibugs>	 10Data-Engineering, 10Equity-Landscape, 10Epic: Deploy the GDI Equity Landscape Dashboard - https://phabricator.wikimedia.org/T305468 (10ntsako) a:03ntsako
[09:38:24] <joal>	 !log Deploying refinery onto HDFS
[09:38:25] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:48:08] <joal>	 !log Deploy airflow analytics
[09:48:12] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:56:10] <btullis>	 !log re-running refine_event
[09:56:11] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:02:54] <wikibugs>	 10Data-Engineering, 10Data Pipelines, 10WMDE-TechWish-Maintenance: Migrate or deprecate WMDE Technical Wishes reportupdater jobs - https://phabricator.wikimedia.org/T333537 (10awight)
[11:48:49] <joal>	 Hi mforns 
[11:56:33] <joal>	 !log Kill oozie referer_daily job - migrated to airflow
[11:56:34] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:08:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:11:09] <joal>	 !log Kill virtualpageview oozie job - migrated to airflow
[12:11:10] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:13:58] <wikibugs>	 10Data-Engineering, 10Data-Services, 10VPS-Projects, 10WMDE-References-FocusArea, 10WMDE-TechWish-Sprint-2023-03-14: Requesting Cloud VPS access to NFS mount /public/dumps - https://phabricator.wikimedia.org/T333549 (10awight)
[12:15:12] <wikibugs>	 10Data-Engineering, 10WMDE-References-FocusArea, 10WMDE-TechWish-Sprint-2023-03-14: Shut down our previous Cloud VPS project and create a new one - https://phabricator.wikimedia.org/T332040 (10awight)
[12:15:20] <wikibugs>	 10Data-Engineering, 10Data-Services, 10VPS-Projects, 10WMDE-References-FocusArea, 10WMDE-TechWish-Sprint-2023-03-14: Requesting Cloud VPS access to NFS mount /public/dumps - https://phabricator.wikimedia.org/T333549 (10awight)
[12:16:14] <wikibugs>	 10Data-Engineering, 10WMDE-References-FocusArea, 10WMDE-TechWish-Sprint-2023-03-14: Shut down our previous Cloud VPS project and create a new one - https://phabricator.wikimedia.org/T332040 (10awight) 05Open→03Resolved a:03awight
[12:22:53] <joal>	 ping mforns in case you're around
[12:32:10] <joal>	 !log Deploy airflow hotfix for referer_daily
[12:32:14] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:38:52] <elukey>	 joal: o/ I am seeing some tcp RST related to TLS handshakes on kafka-jumbo1001, the ips are all an-worker-related
[12:39:03] <elukey>	 is there anything else other than gobblin that pulls from kafka?
[12:39:25] <joal>	 elukey: some flink jobs do
[12:40:23] <elukey>	 ahh interesting
[12:40:30] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Investigate DB connection issues faced from airflow on an-launcher1002 - https://phabricator.wikimedia.org/T331265 (10JArguello-WMF) a:05nfraison→03None
[12:41:01] <elukey>	 joal: search-related flink jobs?
[12:41:17] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10JArguello-WMF)
[12:41:31] <joal>	 elukey: search, and possibly page-change
[12:42:01] <wikibugs>	 10Data-Engineering, 10Data-Services, 10VPS-Projects, 10WMDE-References-FocusArea, 10WMDE-TechWish-Sprint-2023-03-14: Requesting Cloud VPS access to NFS mount /public/dumps - https://phabricator.wikimedia.org/T333549 (10awight) 05Open→03Resolved a:03awight Well, that was fast!  Thanks again :-D
[12:42:20] <joal>	 elukey: I think page-change is currently running with k8s, but some jobs run in Yarn I think
[12:42:35] <elukey>	 joal: mmm page-change should be on DSE using kafka-main, but search jobs may be using it
[12:47:51] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Refactor analytics-meta MariaDB layout to use an-mariadb100[12] - https://phabricator.wikimedia.org/T284150 (10JArguello-WMF) p:05High→03Medium
[12:48:07] <elukey>	 the other alternative is that gobblin, for some reason, doesn't like the new cert
[12:49:00] <wikibugs>	 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10JArguello-WMF) a:05BTullis→03Stevemunene
[12:56:08] <joal>	 ottomata: I'm sorry I won't make it to the meeting with TNG - I was feeling not usefull in previous meetings, so I guess it's ok :)
[13:05:38] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Investigate DB connection issues faced from airflow on an-launcher1002 - https://phabricator.wikimedia.org/T331265 (10BTullis) 05Open→03Resolved a:03BTullis To wrap up this investigation, here's a brief summary as I...
[13:10:55] <elukey>	 ok so I tested the same tshark filter on an hadoop worker node, I see some tcp rsts for port 9093, but also for other non-pki brokers
[13:11:02] <elukey>	 so it may be how the kafka client behaves
[15:13:37] <wikibugs>	 10Data-Engineering-Planning, 10Data-Catalog, 10Shared-Data-Infrastructure: Review and improve the build process for DataHub containers - https://phabricator.wikimedia.org/T303381 (10BTullis) a:03BTullis
[15:14:39] <wikibugs>	 (03PS10) 10Aqu: Migrate refine webrequest to Airflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073)
[15:15:08] <wikibugs>	 10Data-Engineering-Planning, 10Data-Catalog, 10Shared-Data-Infrastructure: Review and improve the build process for DataHub containers - https://phabricator.wikimedia.org/T303381 (10BTullis)
[15:25:06] <wikibugs>	 10Analytics-Radar, 10Data-Engineering-Radar, 10Event-Platform Value Stream, 10Patch-For-Review: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 (10elukey) I've read https://www.golinuxcloud.com/troubleshooting-tls-failures-wireshark/ and found the following tsha...
[15:28:34] <wikibugs>	 10Data-Engineering-Planning, 10Data-Catalog, 10Shared-Data-Infrastructure: Review and improve the build process for DataHub containers - https://phabricator.wikimedia.org/T303381 (10BTullis)
[15:39:12] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: mediwiki-event-enrichment in k8s should use mwapi-async envoy listener for stream config in - https://phabricator.wikimedia.org/T333575 (10Ottomata)
[15:50:15] <wikibugs>	 10Analytics-Radar, 10DC-Ops, 10SRE, 10SRE-swift-storage, 10ops-eqiad: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10Jclark-ctr) @jbond i have batteries for all of these can this be done tomorrow?  If possible can you shut down server and I can preform repair 9am est tomorrow?
[15:52:34] <ottomata>	 joal that's fine!  its optional, only if you want to come! :)
[15:53:18] <wikibugs>	 10Analytics-Radar, 10DC-Ops, 10SRE, 10SRE-swift-storage, 10ops-eqiad: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10jbond) @Jclark-ctr you will need to contacts someone in analytics (possibly @BTullis) and data persistence (maybe @MatthewVernon)
[16:08:59] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:56:44] <wikibugs>	 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson
[17:12:09] <wikibugs>	 10Data-Engineering, 10Advanced-Search, 10All-and-every-Wikisource, 10ArticlePlaceholder, and 66 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10EBernhardson)
[17:20:41] <SandraEbele>	 !log killed Oozie mediawiki-history-check_denormalize job and started Airflow mediawiki_history_check_denormalize dag.
[17:20:42] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:22:26] <wikibugs>	 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10BTullis) @Cmjohnson I can't think of any reason why six disks should have failed. I think they're all single volume RAID 0 logical volumes, aren't they? We've power cycled it a few times with...
[17:30:14] <SandraEbele>	 !log deployed airflow analytics - mediawiki_wikitext dags
[17:30:15] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:28:38] <SandraEbele>	 !log deployed hotfix for airflow mediawiki_wikitext_current and mediawiki_wikitext_history dags.
[18:28:40] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:31:15] <SandraEbele>	 !log Killed Oozie mediawiki-wikitext-history-coord and mediawiki-wikitext-current-coord
[18:31:28] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:32:10] <SandraEbele>	 !log started Airflow mediwiki wikitext dags after killing oozie jobs as part of Migration task.
[18:32:12] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:13:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:23:59] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C: 03+2] EditAttemptStep: Add a new abort type for page updates [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/903845 (https://phabricator.wikimedia.org/T301582) (owner: 10DLynch)
[20:24:39] <wikibugs>	 (03Merged) 10jenkins-bot: EditAttemptStep: Add a new abort type for page updates [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/903845 (https://phabricator.wikimedia.org/T301582) (owner: 10DLynch)
[20:33:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:42:56] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:46:38] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:48:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:07:20] <wikibugs>	 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) @BTullis what HW raid to  not in task
[23:34:10] <wikibugs>	 10Data-Engineering-Planning, 10Data-Catalog, 10Shared-Data-Infrastructure: Review and improve the build process for DataHub containers - https://phabricator.wikimedia.org/T303381 (10BTullis)