[00:00:16] 10Data-Engineering-Planning, 10Data-Catalog, 10Shared-Data-Infrastructure: Review and improve the build process for DataHub containers - https://phabricator.wikimedia.org/T303381 (10BTullis) [00:15:34] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10BTullis) Hi @jclark-ctr Apologies for any omission on my part. For these servers we use RAID1 for the OS, based on the two ris... [00:20:01] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) [00:48:10] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:30:48] (03CR) 10Krinkle: [C: 03+1] navtiming: Add new metrics to allowlist for the navtiming schema. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/902571 (owner: 10Phedenskog) [04:48:10] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:49:45] (03PS1) 10Krinkle: sanitization: Remove some NavigationTiming retentions [analytics/refinery] - 10https://gerrit.wikimedia.org/r/904660 [07:54:02] (03CR) 10Gergő Tisza: [C: 03+1] Add analytics/mediawiki/mentor_dashboard/personalized_praise (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/891368 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [07:55:36] (03CR) 10Gergő Tisza: [C: 03+2] Add fields to analytics/mediawiki/mentor_dashboard/visit [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/901539 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [07:56:08] (03Merged) 10jenkins-bot: Add fields to analytics/mediawiki/mentor_dashboard/visit [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/901539 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [08:03:10] (SystemdUnitFailed) firing: (8) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:04:37] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate search_satisfaction.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329880 (10Gehel) 05Open→03Resolved [08:04:40] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): [Tracking] Migrate Search Airflow jobs to Airflow 2 and use shared supporting code from the data engineering Airflow - https://phabricator.wikimedia.org/T318414 (10Gehel) [08:04:42] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): [Tracking] Migrate Search Airflow jobs to Airflow 2 and use shared supporting code from the data engineering Airflow - https://phabricator.wikimedia.org/T318414 (10Gehel) [08:04:44] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate mediawiki_revision_recommendation_create.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T330447 (10Gehel) 05Open→03Resolved [08:04:46] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): [Tracking] Migrate Search Airflow jobs to Airflow 2 and use shared supporting code from the data engineering Airflow - https://phabricator.wikimedia.org/T318414 (10Gehel) 05Open→03Resolved a:03Gehel [08:04:48] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): [Tracking] Migrate Search Airflow jobs to Airflow 2 and use shared supporting code from the data engineering Airflow - https://phabricator.wikimedia.org/T318414 (10Gehel) [08:04:50] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate transfer_to_es.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329881 (10Gehel) 05Open→03Resolved [08:04:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:05:06] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate glent_weekly.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329872 (10Gehel) 05Open→03Resolved [08:05:08] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): [Tracking] Migrate Search Airflow jobs to Airflow 2 and use shared supporting code from the data engineering Airflow - https://phabricator.wikimedia.org/T318414 (10Gehel) [08:08:15] 10Data-Engineering, 10CirrusSearch, 10Discovery-Search (Current work): Fix permissions in hdfs://analytics-hadoop/wmf/data/discovery - https://phabricator.wikimedia.org/T331580 (10Gehel) 05Open→03Resolved [08:13:47] 10Analytics-Radar, 10DC-Ops, 10SRE, 10SRE-swift-storage, 10ops-eqiad: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10MatthewVernon) @Jclark-ctr I can shut down ms-be1042 for you (or you can DIY, there's no special procedure for this host). Can I confirm you want it shut dow... [08:15:46] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:10] (SystemdUnitFailed) firing: (8) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:38:36] 10Analytics-Radar, 10DC-Ops, 10SRE, 10SRE-swift-storage, 10ops-eqiad: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f76e48e4-3716-4c3a-8992-2858603cabe9) set by btullis@cumin1001 for 4 days, 0:00:00 on 1 host... [08:41:30] (03CR) 10Aqu: Migrate refine webrequest to Airflow (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu) [08:44:06] !log Shutting down an-worker1091 for RAID battery replacement T332883 [08:44:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:44:09] T332883: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 [08:50:50] 10Analytics-Radar, 10DC-Ops, 10SRE, 10SRE-swift-storage, 10ops-eqiad: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10BTullis) @Jclark-ctr I've shut down an-worker1091 so you can replace the battery at any time. Feel free to boot it when the work is finished, as it should re... [08:56:00] 10Analytics-Radar, 10DC-Ops, 10SRE, 10SRE-swift-storage, 10ops-eqiad: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10Jclark-ctr) @matthewvernon 1300 utc will be on site to change battery [08:57:20] 10Analytics-Radar, 10DC-Ops, 10SRE, 10SRE-swift-storage, 10ops-eqiad: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10MatthewVernon) >>! In T332883#8744637, @Jclark-ctr wrote: > @matthewvernon 1300 utc will be on site to change battery Ah, glad I checked! I'll have it shut... [09:05:11] hello folks! [09:05:21] I restarted kafka on kafka-jumbo1002 to move it to pki [09:05:34] I'll proceed slowly over the next days to restart brokers [09:05:45] so we can monitor if any client doesn't like it [09:07:11] 10Analytics-Radar, 10Data-Engineering-Radar, 10Event-Platform Value Stream, 10Patch-For-Review: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 (10elukey) Status - all brokers are getting the new TLS certificates via puppet, I'll keep restarting one broker at th... [09:11:17] elukey: Hi! That's great, thank you. [09:12:21] Just to let you know, I'm going to be away on annual leave for the next two weeks. [09:13:03] btullis: ah nice have a good time off! I'll try not to set on fire Kafka while you are away :) [09:13:42] Much appreciated :-) steve_munene is going to be the primary SRE contact and firefighter in the Data Engineering team, while I'm away. [09:19:14] Hey elukey, that wireshark tutorial page you found is great! https://www.golinuxcloud.com/troubleshooting-tls-failures-wireshark/ I'm going to bookmark that. [09:19:34] Did you see the kerberos one too? https://www.golinuxcloud.com/kerberos-auth-packet-analysis-wireshark/ [09:21:04] ah wow no! I found it yesterday, I was amazed [09:21:17] tshark is so nice, I didn't know about the -Y paramter [09:25:24] steve_munene: o/ if you need help with code reviews etc.. I am available, ping me anytime :) [09:30:03] o/ we're looking into how to best build java projects in gitlab, do you have a java (or scala) project built from gitlab or are they all still in gerrit? [09:32:46] dcausse: We have this one: https://gitlab.wikimedia.org/repos/data-engineering/spark8s - It contains two sub-projects, one in Java and one in Go [09:33:46] That's the only Java job I know about that this team currently builds in GitLab, but I could be wrong. [09:38:53] btullis: thanks! will take a look [10:10:30] PROBLEM - Kafka Broker Server on kafka-test1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [10:10:56] PROBLEM - Check systemd state on kafka-test1006 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:10:59] ^looking [10:11:34] PROBLEM - Kafka broker TLS certificate validity on kafka-test1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [10:12:37] sorry folks I didn't think it would cause alerts, I was about to say in here that I was testing one thing [10:12:51] I thought kafka test was not firing alerts :( [10:13:09] I am trying to find a good prometheus metrics to show if a broker is down [10:13:14] to remove the nagios stuff [10:13:31] https://thanos.wikimedia.org/graph?g0.expr=count(kafka_server_KafkaServer_BrokerState%7Bkafka_cluster%3D%22test-eqiad%22%7D)&g0.tab=0&g0.stacked=0&g0.range_input=15m&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D seems promising [10:15:33] Yeah, I have a vague memory of looking at this a while ago, maybe with g.odog. Something like firing if we can work out if we're below quorum. [10:18:10] RECOVERY - Check systemd state on kafka-test1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:18:50] RECOVERY - Kafka broker TLS certificate validity on kafka-test1006 is OK: SSL OK - Certificate kafka-test1006.eqiad.wmnet valid until 2024-01-13 11:02:00 +0000 (expires in 288 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [10:19:36] RECOVERY - Kafka Broker Server on kafka-test1006 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [10:39:42] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) [10:47:42] thanks elukey: Enjoy your time off btullis \o/ [10:57:41] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10achou) [11:03:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:24] (SystemdUnitFailed) firing: (8) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:07:45] (03PS11) 10Aqu: Migrate refine webrequest to Airflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073) [11:09:40] (03CR) 10Aqu: "1 notice" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu) [11:09:47] PROBLEM - SSH on kafka-jumbo1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:11:07] RECOVERY - SSH on kafka-jumbo1001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:16:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:29] (SystemdUnitFailed) firing: (8) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:35:06] 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform Value Stream (Sprint 10): Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10achou) > I guess this is more of the question: Do we want to ever be able to do this? If... [11:41:56] 10Analytics-Radar, 10DC-Ops, 10SRE, 10SRE-swift-storage, 10ops-eqiad: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e19efa89-db0e-4ad2-bcc9-ed867218f629) set by mvernon@cumin2002 for 1 day, 0:00:00 on 1 host(... [11:43:20] 10Analytics-Radar, 10DC-Ops, 10SRE, 10SRE-swift-storage, 10ops-eqiad: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10MatthewVernon) @Jclark-ctr ms-be1042 shut down ready for you. [12:09:24] (03PS1) 10Aqu: Review Java UDFs used in refine webrequest [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/904778 (https://phabricator.wikimedia.org/T327072) [12:10:15] (03PS2) 10Aqu: Review Java UDFs used in refine webrequest [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/904778 (https://phabricator.wikimedia.org/T327072) [12:11:58] (03CR) 10CI reject: [V: 04-1] Review Java UDFs used in refine webrequest [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/904778 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [12:15:54] (03PS3) 10Aqu: Review Java UDFs used in refine webrequest [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/904778 (https://phabricator.wikimedia.org/T327072) [12:20:04] (03CR) 10CI reject: [V: 04-1] Review Java UDFs used in refine webrequest [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/904778 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [12:23:56] !log deploying datahub to staging T333580 [12:23:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:23:59] T333580: The staging and production deployments of datahub share an Opensearch cluster - https://phabricator.wikimedia.org/T333580 [12:39:16] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Ottomata) [12:45:53] 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform Value Stream (Sprint 10): Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10Ottomata) > Updating the schema later to do this will not be easy, as it would be an inc... [12:47:07] (03CR) 10Gehel: "A few minor comments inline." (035 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/904778 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [13:23:25] 10Analytics-Radar, 10DC-Ops, 10SRE, 10SRE-swift-storage, 10ops-eqiad: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10Jclark-ctr) ms-be1042 is finished @MatthewVernon [13:31:07] 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform Value Stream (Sprint 10): Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10achou) > what are the user use cases for having multiple classifications / embedding pre... [13:32:59] 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10JayCano) @Milimetric We created https://phabricator.wikimedia.org/T332420 to talk about how to identify temp users but I failed to notify everyone involved. Given that bo... [13:33:15] 10Analytics-Radar, 10DC-Ops, 10SRE, 10SRE-swift-storage, 10ops-eqiad: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10Jclark-ctr) 05Open→03Resolved an-worker1091 @btullis Thanks for shutting down server Battery has been replaced [13:41:32] joal: hi! :] I managed to make the netflow jobs pipeline pass the tests (by moving the plugins to wmf_airflow_common), whenever you have time, I'd appreciate the final review :-) Thanks a lot! https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/217 [13:41:45] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10JArguello-WMF) [13:41:48] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10JArguello-WMF) [13:41:55] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Deploy ceph mon and mgr processes to data-engineering cluster - https://phabricator.wikimedia.org/T330149 (10JArguello-WMF) 05Open→03Resolved [13:41:59] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10JArguello-WMF) [13:42:03] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 (10JArguello-WMF) 05Open→03Resolved [13:42:53] oh, joal, also pushed this puppet change to modify airflow config to account for the new plugins location, Andrew helped me with that and said we can deploy that whenever: https://gerrit.wikimedia.org/r/c/operations/puppet/+/904609 [13:43:12] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Ladsgroup) [13:51:44] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) [13:54:52] 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (sprint 10): 13 new wikis missing from mediawiki_history - https://phabricator.wikimedia.org/T329119 (10JArguello-WMF) 05Open→03Resolved [13:54:54] 10Data-Engineering, 10Data Pipelines: Airflow Hackathon (May 2022) - https://phabricator.wikimedia.org/T307500 (10JArguello-WMF) [14:00:13] 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 11): 2 additional new wikis - https://phabricator.wikimedia.org/T332070 (10JArguello-WMF) [14:03:08] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) [14:35:52] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 10): [SPIKE] tune memory and latency of mediawiki-event-enrichment on k8s - https://phabricator.wikimedia.org/T332166 (10JArguello-WMF) 05Open→03Resolved [14:35:58] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): EventStreamCatalog removes 'topic' table option if connector = upsert-kafka - https://phabricator.wikimedia.org/T330769 (10JArguello-WMF) 05Open→03Resolved [14:36:00] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review: mediawiki-event-enrichment should support the latest eventutilities-python changes - https://phabricator.wikimedia.org/T330994 (10JArguello-WMF) 05Open→03Resolved [14:36:02] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Flink EventStreamCatalog should not prevent creation of VIEWs - https://phabricator.wikimedia.org/T330703 (10JArguello-WMF) 05Open→03Resolved [14:36:04] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Flink EventStreamCatalog should add watermark - https://phabricator.wikimedia.org/T330441 (10JArguello-WMF) 05Open→03Resolved [14:38:51] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Improve mediawiki-event-enrichment test suite - https://phabricator.wikimedia.org/T328013 (10JArguello-WMF) [14:38:55] 10Data-Engineering, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 11): Store Flink HA metadata in Zookeeper for mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T331283 (10JArguello-WMF) [14:38:58] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Document Flink job deployment to k8s - https://phabricator.wikimedia.org/T329629 (10JArguello-WMF) [14:39:00] 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 11): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10JArguello-WMF) [14:39:03] 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform Value Stream (Sprint 11): Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10JArguello-WMF) [14:40:05] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 11): Flink Enrichment monitoring - https://phabricator.wikimedia.org/T328925 (10JArguello-WMF) [14:40:12] 10Data-Engineering, 10Data Pipelines (Sprint 11): eventutilities-python: issue async requests from MapFunction context - https://phabricator.wikimedia.org/T332948 (10JArguello-WMF) [14:41:08] 10Data-Engineering, 10serviceops, 10Data Pipelines (Sprint 11), 10Epic, 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10JArguello-WMF) [14:41:10] 10Data-Engineering-Planning, 10serviceops, 10Data Pipelines (Sprint 11), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10JArguello-WMF) [14:42:52] 10Data-Engineering, 10serviceops, 10Epic, 10Event-Platform Value Stream (Sprint 11), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10JArguello-WMF) [14:42:59] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 11): eventutilities-python: issue async requests from MapFunction context - https://phabricator.wikimedia.org/T332948 (10JArguello-WMF) [14:43:09] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 11), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10JArguello-WMF) [14:43:18] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Flink Enrichment monitoring - https://phabricator.wikimedia.org/T328925 (10JArguello-WMF) [14:46:38] folks I moved up to kafka-jumbo1004 to pki [14:46:43] will do the rest of the brokers next wek [14:46:45] *week [14:46:49] so far all good [14:48:43] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11), 10Patch-For-Review: Streaming services errors should be routed to an error event topic. - https://phabricator.wikimedia.org/T326536 (10JArguello-WMF) [15:15:43] 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform Value Stream (Sprint 11): Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10diego) > I mentioned embeddings + classifications because embeddings usually serve as t... [15:19:04] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:42:32] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10BTullis) [16:35:04] elukey: thank you for the TLS work! [16:36:27] 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform Value Stream (Sprint 11): Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10Ottomata) > I imagine it may be useful to have them in the same event stream. We coul... [16:41:44] 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 11): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ladsgroup) Hi, sorry, I just came back from ooo. I want to take a step back a... [16:49:20] OK folks. I'm about to sign off for the day. I'll be on leave for the next two weeks, so I'll be back at my desk on Monday 17th of April. [16:50:45] My phone numbers are on officewiki and stuff, if you need me for anything though. [16:53:56] 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform Value Stream (Sprint 11): Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10diego) >>! In T331401#8746316, @Ottomata wrote: >> I imagine it may be useful to have t... [16:57:03] 10Data-Engineering-Planning, 10Release-Engineering-Team, 10GitLab (CI & Job Runners), 10Performance Issue: Improve speed of Gitlab CI - https://phabricator.wikimedia.org/T311111 (10xcollazo) @lbowmaker for your consideration. We could win significant developer productivity by tackling this one. [16:58:24] 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform Value Stream (Sprint 11): Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10Ottomata) > having them in the same stream Just to be clear! 'same event' 'same stream... [17:08:40] 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10wiki_willy) @Cmjohnson / @Jclark-ctr - maybe we can try upgrading the firmware first if it's outdated? Thanks, Willy [17:20:44] 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform Value Stream (Sprint 11): Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10diego) Yes, I was thinking on the same event. Like: ` scores: model_name: exam... [18:19:56] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats: "Active editors by country" doesn't display numbers for Belarus, Kazakhstan, Russia - https://phabricator.wikimedia.org/T333716 (10Pavel1314) [18:47:17] 10Data-Engineering, 10Data-Engineering-Wikistats: A way to see pageviews statistics in regional level - https://phabricator.wikimedia.org/T333718 (10MusikAnimal) I'm guessing this refers to #data-engineering-wikistats, while {T333677} refers to #tool-pageviews. [18:53:50] 10Data-Engineering, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), 10Epic, and 2 others: AQS 2.0: Media Analytics Service - https://phabricator.wikimedia.org/T288303 (10VirginiaPoundstone) {T321702} contains an inspiring use case for the GLAM community. Might require some bespoke data tabling an... [19:02:38] 10Data-Engineering, 10Data-Engineering-Wikistats: A way to see pageviews statistics on a geographical regional level - https://phabricator.wikimedia.org/T333718 (10Aklapper) [19:19:50] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:27:26] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) [20:00:24] 10Data-Engineering, 10Data-Engineering-Wikistats: A way to see pageviews statistics on a geographical regional level - https://phabricator.wikimedia.org/T333718 (10Nikosgranturismogt) [22:09:42] 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Jclark-ctr) opened Dell ticket. sent support assist Confirmed: Service Request 165406278 was successfully submitted. [23:23:46] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:28:05] 10Data-Engineering, 10Observability-Logging: eventgate-analytics-external logs field explosion - https://phabricator.wikimedia.org/T333729 (10colewhite) [23:29:17] 10Data-Engineering, 10Observability-Logging: eventgate-analytics-external logs field explosion - https://phabricator.wikimedia.org/T333729 (10colewhite) [23:52:34] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10colewhite)