[02:31:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [04:31:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [04:42:13] 10Data-Engineering: stat1008's /srv partition is getting full due to home dirs - https://phabricator.wikimedia.org/T337246 (10santhosh) Cleaned up some files. 295GB-> 212GB. I am in middle of a project that require wiki dumps processing. I should be able to remove all these once it is done. sorry for inconvenien... [07:37:58] 10Data-Engineering, 10SRE, 10Shared-Data-Infrastructure, 10serviceops: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 (10elukey) @jbond thanks for checking! I think that the main question mark is what a client cert for kafka mirror maker (and potentially also... [07:41:06] 10Data-Engineering, 10SRE, 10Shared-Data-Infrastructure, 10serviceops: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 (10JMeybohm) >>! In T337248#8875545, @elukey wrote: > @jbond thanks for checking! I think that the main question mark is what a client cert f... [08:08:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [08:18:26] 10Data-Engineering: stat1008's /srv partition is getting full due to home dirs - https://phabricator.wikimedia.org/T337246 (10elukey) 05Open→03Resolved a:03elukey ` elukey@stat1008:~$ df -h | grep srv /dev/mapper/vg0-srv 7.2T 3.3T 3.5T 49% /srv ` We are in good shape now! Thanks all! [08:51:13] Hello, btullis, on the test cluster, some Airflow jobs are failing when run on an-test-woker1001 with: file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar does not exist [08:51:13] 23/05/24 08:32:36 ERROR SessionState: file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar does not exist [08:51:56] Eventually, after retrying a lot, they are passing because running on an-test-worker1002.eqiad.wmnet [09:01:54] 10Data-Engineering-Planning, 10Data Pipelines, 10Event-Platform Value Stream: Event partitions missing since 2023-02-21T10:00 for stream without events (canary events not produced?) - https://phabricator.wikimedia.org/T330236 (10Antoine_Quhen) Refinery-source does not ship Scala anymore because it was inclu... [09:03:25] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10BTullis) [09:04:02] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10BTullis) a:03Stevemunene [09:10:34] 10Data-Engineering, 10SRE, 10Shared-Data-Infrastructure, 10serviceops, 10Patch-For-Review: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 (10jbond) > We can create multiple certs with the same CN on different machines (or even on the same machine). Thats us... [09:14:30] (03CR) 10Gergő Tisza: Add analytics/mediawiki/mentor_dashboard/interaction (034 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/919236 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [09:58:25] 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 14 A): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena) Hopping on this thread to confirm that we are now able to store sn... [10:12:23] aqu: Oh, interesting. Thanks for the ping. I'll put a note on: T329363 and look into how that jar should be deployed. [10:12:24] T329363: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 [10:21:57] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) There is an issue with an-test-client1001 now that we have re-enabled yarn. @Antoine_Quhen brought... [10:34:32] 10Data-Engineering, 10AQS2.0, 10API Platform (AQS 2.0 Roadmap), 10Epic, and 2 others: AQS 2.0: Page Analytics Service - https://phabricator.wikimedia.org/T288296 (10Atieno) a:05BPirkle→03Atieno [10:35:33] 10Data-Engineering, 10AQS2.0, 10API Platform (AQS 2.0 Roadmap), 10Epic, and 2 others: AQS 2.0: Page Analytics Service - https://phabricator.wikimedia.org/T288296 (10Atieno) [10:38:30] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) Interesting! The `hive-hcatalog` package is installed on both the bullseye worker and the buster w... [10:42:03] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Define Service Level Objective (SLO) for mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T333833 (10gmodena) There's a draft at https://docs.google.com/document/d/1U2bYVqmEsn7ryP0dtFUr-S5xPqF9_plLIFdzk883HBc/... [10:43:07] aqu: I've added some notes to https://phabricator.wikimedia.org/T329363#8876220 - Do you think we should exclude an-test-worker1001 from yarn until this can be fixed? [10:55:17] (03CR) 10Nikerabbit: Show a warning for unused languages with localization over 75% (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/921329 (https://phabricator.wikimedia.org/T336752) (owner: 10Nik Gkountas) [11:35:34] 10Data-Engineering, 10MediaWiki-Vagrant, 10MediaWiki-extensions-EventLogging: Interface 'Wikimedia\MetricsPlatform\EventSubmitter' not found - https://phabricator.wikimedia.org/T337383 (10Tgr) [11:38:00] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) I'm doing a couple of experimental builds of hive packages for buster and bullseye, from [[https:/... [11:47:05] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) I have found the script in bigtop which was supposed to have been responsible for creating the sym... [12:01:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 90% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [12:21:31] !log rerun failed druid_load_pageviews_hourly_aggregated_daily [12:21:32] Schedule: @daily info Next Run: 2023-05-24, 00:00:00 [12:21:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:21:51] !log rerun failed druid_load_pageviews_hourly_aggregated_daily 2023-05-17 [12:21:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:54:23] (03CR) 10Gergő Tisza: [C: 03+2] Add analytics/mediawiki/mentor_dashboard/interaction [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/919236 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [12:55:11] (03Merged) 10jenkins-bot: Add analytics/mediawiki/mentor_dashboard/interaction [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/919236 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [13:07:00] (03PS3) 10Jbond: udplog: /etc/udp2log should be a folder not a file [analytics/udplog] - 10https://gerrit.wikimedia.org/r/922573 (https://phabricator.wikimedia.org/T276622) [13:11:44] (03PS4) 10Jbond: udplog: /etc/udp2log should be a folder not a file [analytics/udplog] - 10https://gerrit.wikimedia.org/r/922573 (https://phabricator.wikimedia.org/T276622) [13:12:48] (03CR) 10Jbond: "@filippo this has already been reviewed by moritz" [analytics/udplog] - 10https://gerrit.wikimedia.org/r/673596 (https://phabricator.wikimedia.org/T276623) (owner: 10Majavah) [13:17:43] (03PS5) 10Jbond: udplog: /etc/udp2log should be a folder not a file [analytics/udplog] - 10https://gerrit.wikimedia.org/r/922573 (https://phabricator.wikimedia.org/T276622) [13:24:38] (03CR) 10Filippo Giunchedi: "LGTM" [analytics/udplog] - 10https://gerrit.wikimedia.org/r/922573 (https://phabricator.wikimedia.org/T276622) (owner: 10Jbond) [13:24:57] (03CR) 10Filippo Giunchedi: "Ack! (LGTM)" [analytics/udplog] - 10https://gerrit.wikimedia.org/r/673596 (https://phabricator.wikimedia.org/T276623) (owner: 10Majavah) [13:27:28] 10Data-Engineering, 10Event-Platform Value Stream: Remove user is_registered field from mediawiki/page/change schema - https://phabricator.wikimedia.org/T337395 (10Ottomata) [13:28:47] 10Data-Engineering, 10Event-Platform Value Stream: Remove user is_registered field from mediawiki/page/change schema - https://phabricator.wikimedia.org/T337395 (10Ottomata) [13:35:58] (03PS3) 10Jbond: Switch to systemd [analytics/udplog] - 10https://gerrit.wikimedia.org/r/673596 (https://phabricator.wikimedia.org/T276623) (owner: 10Majavah) [13:36:11] (03PS6) 10Jbond: udplog: /etc/udp2log should be a folder not a file [analytics/udplog] - 10https://gerrit.wikimedia.org/r/922573 (https://phabricator.wikimedia.org/T276622) [13:36:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 90% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:37:35] (03PS7) 10Jbond: udplog: /etc/udp2log should be a folder not a file [analytics/udplog] - 10https://gerrit.wikimedia.org/r/922573 (https://phabricator.wikimedia.org/T276622) [13:38:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:39:08] 10Data-Engineering, 10Event-Platform Value Stream: Remove user is_registered field from mediawiki/page/change schema - https://phabricator.wikimedia.org/T337395 (10tchin) a:03tchin [13:48:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:49:37] (03CR) 10Jbond: [C: 03+1] Switch to systemd (033 comments) [analytics/udplog] - 10https://gerrit.wikimedia.org/r/673596 (https://phabricator.wikimedia.org/T276623) (owner: 10Majavah) [13:53:40] 10Data-Engineering, 10Data Pipelines (Sprint 13), 10Patch-For-Review: Update Sqoop for externallinks table changes - https://phabricator.wikimedia.org/T335917 (10Antoine_Quhen) Squooping test is conclusive and the patch could be merged right now. https://phabricator.wikimedia.org/P48501 [14:00:24] 10Data-Engineering, 10Event-Platform Value Stream: Use ECS logging fields when adding extra info to mediawiki-event-enrichment - https://phabricator.wikimedia.org/T337399 (10Ottomata) [14:01:32] 10Data-Engineering, 10Event-Platform Value Stream: Use ECS logging fields when adding extra info to mediawiki-event-enrichment - https://phabricator.wikimedia.org/T337399 (10Ottomata) @colewhite @fgiunchedi Is `labels` the proper ECS field in which to add extra logging context? [14:05:18] btullis: about the exclusion of an-test-worker1001. I don't mind using it for testing/canary purpose. Now, it also depends how long it will take for you to have a fix. Because in between it will produce Airflow errors & slas alerts. [14:13:49] aqu: Thanks. I'll see how I do by the end of the day. It looks like it's getting to be pretty involved building new hive packages, so if it is going to take long I will exclude it. [14:14:48] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Improve mediawiki-event-enrichment test suite - https://phabricator.wikimedia.org/T328013 (10CodeReviewBot) tchin opened https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_request... [14:14:59] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Improve mediawiki-event-enrichment test suite - https://phabricator.wikimedia.org/T328013 (10CodeReviewBot) [14:29:21] 10Data-Engineering, 10Event-Platform Value Stream: Get coverage artifacts from Kokkuri - https://phabricator.wikimedia.org/T337400 (10tchin) [14:30:03] 10Data-Engineering, 10Event-Platform Value Stream: Get coverage artifacts from Kokkuri - https://phabricator.wikimedia.org/T337400 (10tchin) [14:30:06] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Improve mediawiki-event-enrichment test suite - https://phabricator.wikimedia.org/T328013 (10tchin) [14:38:20] 10Data-Engineering, 10Event-Platform Value Stream: Use ECS logging fields when adding extra info to mediawiki-event-enrichment - https://phabricator.wikimedia.org/T337399 (10colewhite) >>! In T337399#8877020, @Ottomata wrote: > @colewhite @fgiunchedi Is `labels` the proper ECS field in which to add extra loggi... [14:52:24] I am testing kafka mirror maker on the test cluster [14:52:29] aaaand so far it doesn't work :D [14:52:30] org.apache.kafka.common.errors.TopicAuthorizationException: Not authorized to access topics: [eqiad.mediawiki.revision-create] [14:54:40] :( [14:55:45] 10Data-Engineering-Planning, 10Data Pipelines, 10Event-Platform Value Stream: Event partitions missing since 2023-02-21T10:00 for stream without events (canary events not produced?) - https://phabricator.wikimedia.org/T330236 (10Antoine_Quhen) There is a second problem hidden behind the missing Scala lib: t... [14:56:48] it is weird since I see [14:56:49] Certificate[1]: [14:56:49] Owner: CN=kafka_mirror_maker [14:59:04] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "This is a really beautiful change, thank you Antoine. I especially love the update script on phabricator and the link from the commit mes" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/922172 (https://phabricator.wikimedia.org/T335917) (owner: 10Aqu) [15:09:14] 10Data-Engineering: Druid Webrequest sampled 128 has missing data data for 1 hour - https://phabricator.wikimedia.org/T337088 (10elukey) [15:11:28] 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 14 A): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena) This [[ https://logstash.wikimedia.org/app/dashboards#/view/f3fefa... [15:18:38] !log analytics-refinery, about to deploy [15:18:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:24:14] elukey: What's the current status? Is there something I can look at to try to help you troubleshoot, or have you reverted it? [15:27:49] btullis: I haven't yet, trying to add debug logging on kafka-test1006 but I can't find a real culprit [15:29:39] 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 14 A): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena) `mw_page_content_change_enrich__dse-k8s-eqiad` is not a valid s3 b... [15:30:22] 10Data-Engineering: stat1008's /srv partition is getting full due to home dirs - https://phabricator.wikimedia.org/T337246 (10achou) removed ~155G for `aikochou` [15:31:51] btullis: so I see something like this now [15:31:51] [2023-05-24 15:29:08,592] 1009 [mirrormaker-thread-0] DEBUG org.apache.kafka.common.network.SslTransportLayer - SSL handshake completed successfully with peerHost 'kafka-jumbo1005.eqiad.wmnet' peerPort 9093 peerPrincipal 'CN=kafka-jumbo1005.eqiad.wmnet' cipherSuite 'TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384' [15:33:50] ah wait lol [15:34:03] OK, so that's the SSL handshake, but it doesn't show how the local peer (kafka-test1006) identifies itself. [15:34:05] I see the same error on 1007, on which I haven't run puppet yet [15:34:13] * elukey cries in a corner [15:34:34] [2023-05-24 15:34:19,959] 5474 [kafka-producer-network-thread | kafka-mirror-kafka-test1007-jumbo-eqiad_to_test-eqiad@0] WARN org.apache.kafka.clients.NetworkClient - [Producer clientId=kafka-mirror-kafka-test1007-jumbo-eqiad_to_test-eqiad@0] Error while fetching metadata with correlation id 3 : {eqiad.mediawiki.revision-create=TOPIC_AUTHORIZATION_FAILED} [15:34:44] so it is broken in there probably [15:36:31] 10Data-Engineering, 10Equity-Landscape: Population input metrics - https://phabricator.wikimedia.org/T309279 (10JAnstee_WMF) @ntsako signing off on this [15:36:37] the real issue seems to be [15:36:37] [2023-05-24 15:36:17,391] 3904 [mirrormaker-thread-0] ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback - Error when sending message to topic eventlogging_SearchSatisfaction with key: null, value: 1035 bytes with error: [15:39:05] and it is on the producer side, so mirror -> test brokers [15:39:10] I am wondering if we are missing acls in there [15:39:46] stevemunene I'm in the process of deploying analytics/refinery and I'm stuck at our current bug with git when running `sudo -u hdfs kerberos-run-command hdfs /srv/deployment/analytics/refinery/bin/refinery-deploy-to-hdfs --verbose --no-dry-run` from `an-launcher1002:/srv/deployment/analytics/refinery` (and later from an-test-coord1001.eqiad.wmnet) [15:39:46] Could you `chmod +w /srv/deployment/analytics/refinery/.git/objects` temporarly? [15:40:27] 10Data-Engineering, 10Equity-Landscape: Access input metrics - https://phabricator.wikimedia.org/T324968 (10JAnstee_WMF) @ntsako signing off on this [15:42:57] 10Data-Engineering, 10Equity-Landscape: 2022 - Social Progress index - https://phabricator.wikimedia.org/T330897 (10JAnstee_WMF) @ntsako signing off on this [15:43:54] btullis: added https://phabricator.wikimedia.org/T250250#6138068 [15:43:55] 10Data-Engineering, 10Equity-Landscape: 2022 - Affiliates input metrics - https://phabricator.wikimedia.org/T330295 (10JAnstee_WMF) @ntsako signing off on this [15:44:58] btullis: aaand it works [15:45:22] ottomata: any chance you can apporove this https://gerrit.wikimedia.org/r/c/operations/puppet/+/922799 access to analytics-privatedata-users, its a contractor that has been re-hiered [15:48:44] !log run `kafka acls --add --allow-principal User:CN=kafka_mirror_maker --producer --topic '*'` on kafka test - T337248 [15:48:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:48:49] T337248: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 [15:49:38] btullis, ottomata - I think that kafka test works, I'd proceed with jumbo and main if you can double check as well metrics [15:50:12] elukey: Nice. You only had to add the ACLs to the kafka-test cluster. Is that right? [15:50:30] exactly yes [15:50:36] the other clusters should be configured [15:51:10] Yep, got it. I'm here for the next 40 minutes, but I've got much more time tomorrow if you'd rather do it then. [15:51:21] I don't see metrics for test in https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker but the logs look good [15:51:26] aqu: I have followed the steps outlined here https://phabricator.wikimedia.org/T334493#8855435 it should work right now. documenting and bringing this to the attention of btullis [15:51:35] btullis: I can run puppet now on jumbo1001 [15:52:01] elukey: ack - looking at the metrics. [15:53:29] Thx stevemunene, it's working on an-launcher1002. Could you also fix an-test-coord1001? [15:54:01] yes doing so rn aqu [15:54:44] btullis: 1001 done, looks good so far [15:55:09] aqu: stevemunene: This is not my favourite kind of 'working' because it's still a workaround. I thought that this fix was supposed to fix it for good, but maybe it doesn't apply correctly for subdirectories: https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/profile/manifests/analytics/refinery_git_config.pp [15:55:14] I'll move jumbo's mirror makers to pki [15:55:42] thought so as well btullis we need to add the revs sub directory [15:56:31] !log move kafka mirror on kafka jumbo brokers to PKI - T337248 [15:56:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:56:34] T337248: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 [15:57:45] elukey: Looks good to me so far. [16:01:50] rolling it out slowly, 5/9 hosts done [16:01:55] metrics look good afaics [16:01:59] will do main afterwards [16:05:40] !log move kafka mirror on kafka main brokers to PKI - T337248 [16:05:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:05:45] T337248: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 [16:07:02] (03PS1) 10Snwachukwu: Resolve Guava toImmutableList method error [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/922894 [16:10:44] btullis, ottomata - all done! [16:11:06] elukey: Excellent! Thanks so much. [16:11:23] will follow up with the clean ups :) [16:12:40] 10Data-Engineering, 10SRE, 10Shared-Data-Infrastructure, 10serviceops: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 (10elukey) Rolled out the new keystores to all clusters! Next steps: * Clean up kafka mirror's classes as suggested in https://gerrit.wikime... [16:12:56] btullis: we can do a similar thing with varnishkafka [16:13:00] should work fine as well [16:13:30] ottomata: when you see this please check if mirror maker works as expected, I think so but a confirmation from you would be nice :) [16:19:13] !log Deployed refinery using scap, then deployed onto hdfs [16:19:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:24:25] 10Data-Engineering: Upgrade Presto to access UDF library improvements - https://phabricator.wikimedia.org/T295589 (10JArguello-WMF) [16:33:07] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) FYI: I just added swift access key to wikikube main mw-pa... [16:34:05] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) [16:49:53] elukey: i just checked codfw.mediawiki.revision-create on kafka jumbo and see some new messages there (I guess somehow wikidata can create revisions in codfw? ) that means that main-codfw -> main-eqiad -> jumbo-eqiad is working :) [16:56:06] 10Data-Engineering-Planning, 10GitLab (CI & Job Runners), 10Performance Issue, 10Release-Engineering-Team (Radar): Improve speed of Gitlab CI - https://phabricator.wikimedia.org/T311111 (10thcipriani) [16:56:23] 10Data-Engineering-Planning, 10GitLab (CI & Job Runners), 10Performance Issue: Improve speed of Gitlab CI - https://phabricator.wikimedia.org/T311111 (10thcipriani) [17:30:27] 10Data-Engineering, 10Data Pipelines, 10Event-Platform Value Stream: Fix wikimedia-event-utilities Guava dependencies issues - https://phabricator.wikimedia.org/T337421 (10Ottomata) a:05Snwachukwu→03None [17:32:51] 10Data-Engineering-Planning, 10Data Pipelines, 10Event-Platform Value Stream: Event partitions missing since 2023-02-21T10:00 for stream without events (canary events not produced?) - https://phabricator.wikimedia.org/T330236 (10Ottomata) Filed: {T337421} [18:34:56] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) [18:37:46] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): jsonschema-tools tests - ensure that array items type is set - https://phabricator.wikimedia.org/T329515 (10Ottomata) PR here: https://github.com/wikimedia/jsonschema-tools/pull/41 @tchin could you review ^ ? [18:38:53] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): jsonschema-tools tests - ensure that array items type is set - https://phabricator.wikimedia.org/T329515 (10Ottomata) [18:39:00] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): jsonschema-tools tests - ensure that array items type is set - https://phabricator.wikimedia.org/T329515 (10Ottomata) a:03Ottomata [18:54:46] 10Data-Engineering, 10Anti-Harassment, 10Data-Persistence, 10IP Masking, and 2 others: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Ladsgroup) TBH, any new concept is going to be confusing for a while. [19:25:18] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Document Flink job deployment to k8s - https://phabricator.wikimedia.org/T329629 (10Ottomata) I think we don't want to move most of that stuff, the value stream and POC stuff can stay there. We mostly want organized user manual docs abo... [19:56:02] (03PS2) 10Xcollazo: Add iceberg version of referrer_daily table. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/917404 (https://phabricator.wikimedia.org/T335305) [19:58:45] (03CR) 10Xcollazo: Add iceberg version of referrer_daily table. (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/917404 (https://phabricator.wikimedia.org/T335305) (owner: 10Xcollazo) [20:11:13] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Improve Event Platform and MediaWiki Event Enrichment wikitech documentation - https://phabricator.wikimedia.org/T329629 (10Ottomata) [20:11:28] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Improve Event Platform and MediaWiki Event Enrichment wikitech documentation - https://phabricator.wikimedia.org/T329629 (10Ottomata) I updated the [[ https://wikitech.wikimedia.org/wiki/Event_Platform/Event_Utilities | Event Utilities d... [20:33:31] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) [21:26:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [21:36:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage