[00:26:42] (03CR) 10Gergő Tisza: [C: 03+2] helppanel: Document savedTaskType in action_data for trynewtask-impression [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/899637 (https://phabricator.wikimedia.org/T330637) (owner: 10Kosta Harlan) [00:27:19] (03Merged) 10jenkins-bot: helppanel: Document savedTaskType in action_data for trynewtask-impression [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/899637 (https://phabricator.wikimedia.org/T330637) (owner: 10Kosta Harlan) [00:49:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:36] matthiasmullie: Good morning [08:31:33] matthiasmullie: You're running a spark job that takes 20% of cluster resources - Can you please make it smaller? [08:33:09] (03Restored) 10Hashar: [Full dump analysis] Reduce edits_only and reverts_only intricacy [analytics/wikistats] - 10https://gerrit.wikimedia.org/r/118436 (owner: 10Nemo bis) [08:33:15] (03Restored) 10Hashar: Archives are downloaded in .txt.gz format: fix matching and opening [analytics/wikistats] - 10https://gerrit.wikimedia.org/r/92066 (owner: 10Nemo bis) [08:34:10] (03Restored) 10Hashar: Remove all trailing whitespace [analytics/wikistats] - 10https://gerrit.wikimedia.org/r/145862 (owner: 10Nemo bis) [08:34:14] (03CR) 10Hashar: "I am not sure this one is worth merging in given the repo is to be archived (T332004). Then if you rebase the patches I am happy to get it" [analytics/wikistats] - 10https://gerrit.wikimedia.org/r/145862 (owner: 10Nemo bis) [08:34:37] (03Restored) 10Hashar: Comment some path tests which overrode standard ones [analytics/wikistats] - 10https://gerrit.wikimedia.org/r/118261 (owner: 10Nemo bis) [08:35:09] (03CR) 10Hashar: Comment some path tests which overrode standard ones (031 comment) [analytics/wikistats] - 10https://gerrit.wikimedia.org/r/118261 (owner: 10Nemo bis) [08:35:52] joal: aborted it [08:36:04] thanks a lot matthiasmullie :) [08:36:25] 10Analytics-Radar, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Diffusion-Repository-Administrators, and 4 others: Archive analytics/wikistats - https://phabricator.wikimedia.org/T332004 (10hashar) >>! In T332004#8699368, @Milimetric wrote: >>>! In T332004#8696727, @hashar wrote: > It is wikistats 2... [08:46:00] 10Data-Engineering, 10Event-Platform Value Stream, 10Metrics-Platform-Planning, 10Product-Analytics, 10WMF-Architecture-Team: Major (API) versioning of Event Platform streams - https://phabricator.wikimedia.org/T332212 (10Ottomata) [08:46:15] 10Data-Engineering, 10Event-Platform Value Stream, 10Metrics-Platform-Planning, 10Product-Analytics, 10WMF-Architecture-Team: Major (API) versioning of Event Platform streams - https://phabricator.wikimedia.org/T332212 (10Ottomata) > complete the second con Oops, done ty! [08:49:04] 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10dcausse) Hi everyone and sorry to jump into this conversion but just wanted t... [09:26:27] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Marostegui) [09:43:28] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [09:58:34] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Deploy ceph mon processes to data-engineering cluster - https://phabricator.wikimedia.org/T330149 (10BTullis) I have created the initial monmap for bootstrapping the cluster with the following command. ` btullis@cephosd100... [10:03:55] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [10:15:04] 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena) >>! In T330693#8698639, @MatthewVernon wrote: > This is a k8s applic... [10:24:46] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Joe) Hi, I have a few questions, if the plan is we move this... [10:24:58] 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena) Hi Eric, > I know; I didn't mean for this to come across as an indi... [13:13:39] (03PS1) 10Btullis: Experimental refactor of the datahub container build process [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/900310 (https://phabricator.wikimedia.org/T301453) [13:16:28] (03CR) 10CI reject: [V: 04-1] Experimental refactor of the datahub container build process [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/900310 (https://phabricator.wikimedia.org/T301453) (owner: 10Btullis) [13:23:28] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [13:29:10] (03PS2) 10Btullis: Experimental refactor of the datahub container build process [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/900310 (https://phabricator.wikimedia.org/T301453) [13:30:56] (03CR) 10CI reject: [V: 04-1] Experimental refactor of the datahub container build process [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/900310 (https://phabricator.wikimedia.org/T301453) (owner: 10Btullis) [13:37:57] folks, in order to continue the analysis of the memory leak on the hiveserver2 I'd like to run some specific hive query. I would need one that issue some call to one of those UDF org.wikimedia.analytics.refinery.core.UAParse, org.wikimedia.analytics.refinery.core.referer.RefererClassifier, org.wikimedia.analytics.refinery.core.Webrequest or org.wikimedia.analytics.refinery.core.PageviewDefinition (if possible a not [13:37:57] too big one to be launched multiple times) [13:37:57] Is one of you able to help me to see how to construct such query to run it on the hiveserver2? [13:39:55] I'm there nfraison_ [13:40:35] nfraison_: batcave? https://meet.google.com/rxb-bjxn-nip [13:56:10] 10Analytics-Radar, 10Data-Engineering-Planning, 10Metrics-Platform-Planning, 10CSS: Schema code samples popup appears under the JSON table - https://phabricator.wikimedia.org/T272857 (10phuedx) @He7d3r, @Ammarpad: Please could you add information about which browser you're using when you see this. I see t... [14:01:02] 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Eevans) >>! In T330693#8701662, @gmodena wrote: >>> [ ... ] >>> >>> How woul... [14:05:41] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review: Flink EventStreamCatalog should not prevent creation of VIEWs - https://phabricator.wikimedia.org/T330703 (10tchin) [14:25:15] (03PS6) 10Jennifer Ebe: T330206 - Create Mediacounts Load Hourly HQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/896334 [14:27:24] (03CR) 10Jennifer Ebe: "Done" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/896334 (owner: 10Jennifer Ebe) [14:27:44] (03CR) 10Joal: [C: 03+1] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/896334 (owner: 10Jennifer Ebe) [14:41:17] btullis: Heya - how about that dpeloy? [14:41:36] btullis: if you're onto something else I'll do it next week [14:42:05] Oh yes, can you give me 5 minutes. Just chatting in #wikimedia-k8s-sig for a minute about the datahub build process. [14:50:53] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Metrics-Platform-Planning, 10Product-Analytics: Draft of full process for instrumentation using new client libraries - https://phabricator.wikimedia.org/T275694 (10phuedx) [14:52:21] 10Data-Engineering-Planning, 10Cassandra, 10Image-Suggestions, 10Section-Level-Image-Suggestions: Section Level Image Suggestions - Data Persistence Request - https://phabricator.wikimedia.org/T320831 (10Eevans) [14:57:33] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:07] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:21] joal: Argh, sorry. I forgot, I have a school run now. Then platform_eng airflow upgrade at 16:00 UTC. Can do the deploy after standup, if that helps. [15:08:20] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 (10nfraison) Taking HeapDump of the test hiveserver2 and analyzing it with MAT show multiple instances of lots of our Sin... [15:08:49] btullis: no worries, we'll do next week [15:13:11] FI root cause of the metaspace / old GC leak on the hiveserver2 has been found [15:13:11] It is due to the way of hiveserver manage query and our UDF singleton [15:13:11] To ensure no conflict of class from multiple jobs relying on same class but from different release of a jar hive create a dedicated classloader for it. [15:13:11] Singleton rely on the fact that an instance of the class is available in the current classloader which means that with a different classloader we issue a new singleton... [15:13:11] Then as singleton are never reclaimed we leak those instances + all of the classes required by it in that classloader leading to the leak on the Metaspace and on the Old generation [15:13:12] Here are some information about singleton and issue with different classloading: https://www.infoworld.com/article/2073352/core-java-simply-singleton.html?page=3 + some potential solutions [15:13:13] We should discuss on the usage of those singleton: do we really need them as singleton? could we apply potentail solutions from the doc? any other ideas? [15:14:35] Ex of singleton UDF leaking [15:14:35] Class Name | Count | Defined Classes | No. of Instances [15:14:35] ----------------------------------------------------------------------------------------------------- [15:14:35] org.wikimedia.analytics.refinery.hive.IsPageviewUDF | 299 | | [15:14:35] org.wikimedia.analytics.refinery.core.Webrequest | 299 | | [15:14:36] org.wikimedia.analytics.refinery.core.PageviewDefinition| 299 | | [15:14:36] ----------------------------------------------------------------------------------------------------- [15:16:47] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:17] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:46] nfraison_: if we can, we should get rid of all those singletons! Never a good idea. I think I left comments on a semi recent CR in that direction. And from what I recall, it shouldn't be super difficult to get rid of them. [15:32:03] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:26] 10Analytics, 10Data-Engineering, 10Event-Platform Value Stream, 10EventStreams, and 2 others: Expose rdf-streaming-updater.mutation content through EventStreams - https://phabricator.wikimedia.org/T294133 (10Gehel) [15:36:29] 10Analytics, 10Data-Engineering, 10Event-Platform Value Stream, 10EventStreams, and 2 others: Expose rdf-streaming-updater.mutation content through EventStreams - https://phabricator.wikimedia.org/T294133 (10Gehel) Note that there is additional context in T330521. [15:45:11] (03CR) 10Tsevener: "Note about maybe adding wiki_id to ios_reading_lists too. I'm a bit rusty on that PR though, so let me know if I'm wrong in my thinking!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/898851 (https://phabricator.wikimedia.org/T331481) (owner: 10Mazevedo) [15:47:14] +1 to that gehel and nfraison_ - The practical side says that as we're moving to spark the problem will go away from the hiveServer, but I agree it'd be good to move away from singletons [16:02:53] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:52] !log stopping puppet and airflow services on an-airflow1004 for the upgrade. [16:29:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:31:49] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:46] (03CR) 10Mazevedo: Add new unified mobile apps schema for Session (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/898851 (https://phabricator.wikimedia.org/T331481) (owner: 10Mazevedo) [16:36:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:37:29] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:49] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:49:13] nfraison_: nice work on hive! [16:51:41] !log upgrading airflow package on an-airflow1004 [16:51:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:56:53] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:56:57] -17 [16:57:00] err sorry [16:58:31] thks elukey. [16:58:31] I've just seen that I missed your yesterday messages on the spark/kerb. [16:58:31] As you suggested I will contact serviceops so we can discuss this vault point and see if we could replace it by something already available at wikimedia or if we agree to use it. [16:58:54] super thanks! [17:00:04] !log enabling puppet on an-airflow1004 to restart airflow services. [17:00:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:02:29] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:23] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:09] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:06] (03PS1) 10Snwachukwu: Copy add_partition hql script from Oozie to Hql folder. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/900389 [17:17:58] (03PS2) 10Snwachukwu: Copy add_partition hql script from Oozie to Hql folder. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/900389 (https://phabricator.wikimedia.org/T330200) [17:19:51] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:27:39] 10Analytics-Radar, 10Data-Engineering-Planning, 10Metrics-Platform-Planning, 10CSS: Schema code samples popup appears under the JSON table - https://phabricator.wikimedia.org/T272857 (10Ammarpad) @phuedx, my comment was added over 2 years ago, so it's not surprising the issue has been fixed in the meantime... [17:27:39] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:33:27] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:03] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:34] I'm looking into this an-worker1132 failure now. [17:45:45] (03PS2) 10Mazevedo: Add new unified mobile apps schema for Session [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/898851 (https://phabricator.wikimedia.org/T331481) [17:48:51] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:56:27] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:59:39] (03CR) 10Tsevener: [C: 03+1] "Looks good to me! I'll leave this open in case @Sharvaniharan wants a chance to look at it." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/898851 (https://phabricator.wikimedia.org/T331481) (owner: 10Mazevedo) [18:01:09] It looks like there's just a lot of memory pressure on this host. There are a lot of processes by cmyrick running on the host, possibly related to this job : https://yarn.wikimedia.org/proxy/application_1678266962370_35215/ [18:05:24] So it's currently using about 3.15 GB of swap and I think it's thrashing all of the Hadoop disks https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=an-worker1132&var-datasource=thanos&var-cluster=analytics [18:05:55] I'm not going to touch any processes on it for now, without knowing more about these processes. [18:16:21] 10Data-Engineering, 10IP Masking, 10Product-Analytics: Clarify definitions around anonymous and temporary editors - https://phabricator.wikimedia.org/T332205 (10Niharika) Hi! I want to clarify that the following are code functions and not database fields: * **User::isRegistered() **will return true for all r... [18:17:45] 10Analytics-Radar, 10Data-Engineering-Planning, 10Metrics-Platform-Planning, 10CSS: Schema code samples popup appears under the JSON table - https://phabricator.wikimedia.org/T272857 (10phuedx) [18:22:46] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:26:04] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:29:38] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:11:45] 10Data-Engineering, 10Advanced-Search, 10All-and-every-Wikisource, 10ArticlePlaceholder, and 70 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson)