[06:35:38] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 10): eventutilities-python: issue async requests from MapFunction context - https://phabricator.wikimedia.org/T332948 (10gmodena) I have been able to validate that the idea can work. I was able to init a thread pool local to an operator and execute pa... [06:38:10] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 10): eventutilities-python: issue async requests from MapFunction context - https://phabricator.wikimedia.org/T332948 (10gmodena) [06:38:34] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 10): eventutilities-python: issue async requests from MapFunction context - https://phabricator.wikimedia.org/T332948 (10gmodena) [06:40:55] 10Data-Engineering, 10AQS2.0, 10API Platform (Sprint 06), 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10SGupta-WMF) Verified with the check-list , the test suite is complete and working fine . Thanks! [06:41:11] 10Data-Engineering, 10AQS2.0, 10API Platform (Sprint 06), 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10SGupta-WMF) 05Open→03Resolved [06:41:13] 10Data-Engineering, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), 10Epic, and 2 others: AQS 2.0: Page Analytics Service - https://phabricator.wikimedia.org/T288296 (10SGupta-WMF) [08:07:30] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [08:16:38] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Marostegui) [08:20:59] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [08:33:15] hello folks! [08:33:26] if you are ok I'd rollout https://gerrit.wikimedia.org/r/c/operations/puppet/+/901549 [08:33:34] requires a roll restart of the jumbo brokers [08:43:19] 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-extensions-WikimediaEvents, 10Product-Analytics, and 3 others: Decommission the EditorActivation instrument - https://phabricator.wikimedia.org/T330766 (10phuedx) [08:43:52] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10fgiunchedi) [08:57:00] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [08:57:55] 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-extensions-WikimediaEvents, 10Product-Analytics, and 3 others: Decommission the EditorActivation instrument - https://phabricator.wikimedia.org/T330766 (10phuedx) [08:58:29] (started anyway, first broker restarted, all good afaics) [09:03:34] 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-extensions-WikimediaEvents, 10Product-Analytics, and 3 others: Decommission the EditorActivation instrument - https://phabricator.wikimedia.org/T330766 (10phuedx) [09:03:59] 10Analytics-Kanban, 10Data-Engineering, 10Event-Platform Value Stream, 10Fundraising-Backlog, and 3 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10phuedx) [09:19:27] (the cookbook will likely rake 2/3 hours to complete, but so far I don't see issues) [09:19:59] the end result will be that all brokers will be able to accept certs (from clients and other brokers) emitted by the Puppet CA or PKI [09:20:16] so I am not changing any cert now, only the trust stores [09:20:27] it is the pre-requisite to be able to swap the broker's certs [10:02:08] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10hnowlan) [10:02:28] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ArielGlenn) [10:03:11] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10hnowlan) [10:12:40] btullis, steve_munene - around? [10:31:00] elukey: Yes, I'm here. [10:31:59] hello :) [10:32:00] Sorry I didn't ack the above. All fine by me though. [10:32:34] nono all good :) I'd need to step afk for lunch, I am comfortable in letting the cookbook running since we have only 3 nodes left and the rest are working perfectly well [10:32:50] but I wanted to double check with you and give you an heads up just in case [10:33:21] Yep, gotcha. I'm around for the next few hours anyway, in case of any events. [10:33:26] super thanks :) [10:33:39] after this we'll be ready to flip the first broker to pki :) [10:34:00] I just sent an email about Hadoop HDFS/YARN Superset, Hive, Druid downtime tomorrow related to the switch upgrade. [10:39:11] yes yes didn't mean tomorrow, anytime in the future [10:40:00] Ah yes, sorry for being vague. I didn't mean that these things were linked either. Just an FYI :-) [10:40:34] ahhh okok :D [10:40:36] I saw the email :) [10:40:59] going afk for lunch, ttl! [10:42:17] ack: Enjoy! [11:24:38] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Jelto) [11:34:10] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Jelto) [11:56:43] 10Analytics, 10Data-Engineering-Icebox, 10CX-analytics, 10Language-analytics, 10Technical-Debt: Special:ContentTranslationStats is slow and getting crowded - https://phabricator.wikimedia.org/T325790 (10Nikerabbit) [12:48:31] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:59:06] 10Analytics-Radar, 10Data-Engineering-Radar, 10Event-Platform Value Stream: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 (10elukey) The cluster is now running with the extended trust store (containing both Puppet and PKI's root CA certs). Next steps: - Move k... [13:24:46] 10Analytics-Radar, 10Data-Engineering-Radar, 10Event-Platform Value Stream, 10Patch-For-Review: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 (10Jgreen) >>! In T296064#8728816, @elukey wrote: > The cluster is now running with the extended trust store (containi... [13:29:53] btullis: We'd like to depool clouddb[1015-1016] for some maintenance tomorrow. Do you know off the top of your head how to move traffic off of them? (If not I can just figure it out) [13:30:07] context is https://etherpad.wikimedia.org/p/wmcs-vs-rowb-upgrade [14:11:51] (03PS1) 10Milimetric: test [analytics/refinery] - 10https://gerrit.wikimedia.org/r/903261 [14:13:08] (03PS1) 10Btullis: Tweak the build process and and fix local container builds [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/903262 (https://phabricator.wikimedia.org/T303381) [14:13:45] (03CR) 10Milimetric: test (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/903261 (owner: 10Milimetric) [14:15:16] (03PS2) 10Milimetric: test [analytics/refinery] - 10https://gerrit.wikimedia.org/r/903261 [14:20:43] (03Abandoned) 10Milimetric: test [analytics/refinery] - 10https://gerrit.wikimedia.org/r/903261 (owner: 10Milimetric) [14:30:10] 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform Value Stream (Sprint 10): Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10Ottomata) > If this is the case, we can update the schema version accordingly at that ti... [14:40:58] joal: I'm trying to track down a problem and can't remember if you saw it last week [14:41:11] mediawiki.page-move doesn't have data for March 14, hours 11 and 12 [14:41:23] so like /wmf/data/event/mediawiki_page_move/datacenter\=codfw/year\=2023/month\=3/day\=14/hour\=11 is missing [14:41:55] but also the raw data is missing there, /wmf/data/raw/event/codfw.mediawiki.page-move/year\=2023/month\=03/day\=14/hour\=11 [14:42:25] and I remember something about adding success flags manually to something... but maybe that was a different dataset? [14:46:01] (03CR) 10CI reject: [V: 04-1] Tweak the build process and and fix local container builds [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/903262 (https://phabricator.wikimedia.org/T303381) (owner: 10Btullis) [15:08:55] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) > why is kafka-main a better fit than kafka-jumbo?... [15:12:38] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) Also, clearly we will not be ready to deploy this... [15:26:19] 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) > usefulness of cross-DC replication After asking @dcausse, I unde... [16:00:18] 10Data-Engineering-Planning, 10Data Pipelines: Assign Superset sql_labs access through customer roles - https://phabricator.wikimedia.org/T331160 (10Ottomata) [16:00:24] 10Data-Engineering-Planning, 10Data Pipelines: Review Superset permissions and assign roles as appropriate - https://phabricator.wikimedia.org/T328457 (10Ottomata) [16:00:51] 10Data-Engineering, 10Superset: Create a custom Superset Role to allow for non default permissioning - https://phabricator.wikimedia.org/T298714 (10Ottomata) [16:01:00] 10Data-Engineering-Planning, 10Data Pipelines: Review Superset permissions and assign roles as appropriate - https://phabricator.wikimedia.org/T328457 (10Ottomata) [16:02:09] 10Data-Engineering-Planning, 10Data Pipelines: Review Superset permissions and assign roles as appropriate - https://phabricator.wikimedia.org/T328457 (10Ottomata) > Decide who should be granted access to SQL Lab in Superset IMO, let's just default granting sql_lab access to all Superset accounts. Data acce... [16:06:26] Wow I missed your ping milimetric - please excuse me [16:06:35] Let's talk about this error PS [16:36:32] ottomata: Heya - Would you have a minute for a kafka question? [16:49:35] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:03:47] 10Data-Engineering, 10Advanced-Search, 10All-and-every-Wikisource, 10ArticlePlaceholder, and 69 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson) [17:19:23] !log added 2023-03-14T11 and 2023-03-14T12 partitions for codfw on event.mediawiki_page_move with alter table mediawiki_page_move add partition (datacenter='codfw',year=2023,month=3,day=14,hour=[11,12]); [17:19:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:19:33] (without the square bracket thing... sorry) [17:45:36] 10Data-Engineering, 10Product-Analytics: Pilot ETL migration - https://phabricator.wikimedia.org/T333208 (10mpopov) [17:47:54] 10Data-Engineering, 10Product-Analytics (Kanban): Pilot ETL migration - https://phabricator.wikimedia.org/T333208 (10mpopov) p:05Triage→03High Assigning myself to coordinate selecting candidate ETL jobs. Then will pass it to DE for the initial migration once we're aligned on the work that will need to be d... [17:48:06] 10Data-Engineering, 10Product-Analytics (Kanban): Pilot ETL migration - https://phabricator.wikimedia.org/T333208 (10mpopov) p:05High→03Medium [17:49:23] 10Data-Engineering, 10Product-Analytics (Kanban): Pilot ETL migration - https://phabricator.wikimedia.org/T333208 (10mpopov) a:03mpopov [19:00:42] 10Data-Engineering, 10Product-Analytics (Kanban): Product Analytics ETL Migration: Pilot (MediaSearch ETLs) - https://phabricator.wikimedia.org/T333208 (10mpopov) [19:37:42] 10Data-Engineering-Planning, 10Data Pipelines: Review Superset permissions and assign roles as appropriate - https://phabricator.wikimedia.org/T328457 (10mpopov) +1 to defaulting to all, plus it can be useful to run sql queries against druid data cubes and it makes sense that everyone with superset access be a... [20:07:12] 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Milimetric) [20:21:13] 10Data-Engineering-Planning, 10Data Pipelines: Setup config to allow lineage instrumentation - https://phabricator.wikimedia.org/T333004 (10Milimetric) [20:40:59] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 10): [SPIKE] tune memory and latency of mediawiki-event-enrichment on k8s - https://phabricator.wikimedia.org/T332166 (10Ottomata) Perhaps it would be easier to isolate the memory leak(?) using a profile in a local Flink 'minicluster' instead of in YARN... [20:51:20] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:42:03] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 10): [SPIKE] tune memory and latency of mediawiki-event-enrichment on k8s - https://phabricator.wikimedia.org/T332166 (10Ottomata) Some things I'm noticing: There are more errors like this than I'd expect: ` 2023-03-27 21:23:17,408 WARN /home/otto/.co... [22:09:48] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Volans) >>! In T330165#8731601, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.t... [22:22:37] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 10): [SPIKE] tune memory and latency of mediawiki-event-enrichment on k8s - https://phabricator.wikimedia.org/T332166 (10Ottomata) You can see the stuttering up in [[ https://grafana.wikimedia.org/goto/TeordLf4z?orgId=1 | this panel ]] [22:50:12] 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Ladsgroup) Yup, adding a new column is not that hard. There are some documentation on this https://wikitech.wikimedia.org/wiki/Schema_changes and https://www.mediawiki.or... [23:17:27] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10colewhite) [23:21:49] 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10wiki_willy) a:03Jclark-ctr [23:22:35] 10Analytics-Radar, 10DC-Ops, 10SRE, 10SRE-swift-storage, 10ops-eqiad: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10wiki_willy) a:03Jclark-ctr