[00:38:04] PROBLEM - Check unit status of monitor_refine_eventlogging_legacy on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:39:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:20:56] 10Data-Engineering, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0:Wikistats 2 service - https://phabricator.wikimedia.org/T288301 (10BPirkle) It sounds like the first four lines of the above table are non-controversial. This allows us to move forward with moving the current [[ http... [05:08:00] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for igwikiquote - https://phabricator.wikimedia.org/T314639 (10Marostegui) a:03Marostegui @Ladsgroup this is not done from what I can see - we just got a report about private data still present. #c... [05:11:11] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for igwikiquote - https://phabricator.wikimedia.org/T314639 (10Marostegui) This is not done from what I can see - we just got a report about private data still present. cloud-services-team please DO... [05:11:20] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for bclwikiquote - https://phabricator.wikimedia.org/T316456 (10Marostegui) a:03Marostegui This is not done from what I can see - we just got a report about private data still present. cloud-servic... [05:11:42] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for tlwikiquote - https://phabricator.wikimedia.org/T317111 (10Marostegui) a:03Marostegui @Ladsgroup this is not done from what I can see - we just got a report about private data still present. cl... [06:08:32] 10Analytics, 10Data-Engineering: dbstore1005 "staging" instance is down - https://phabricator.wikimedia.org/T321464 (10Marostegui) p:05Triage→03Medium [06:24:04] 10Analytics, 10Data-Engineering: dbstore1005 "staging" instance is down - https://phabricator.wikimedia.org/T321464 (10Marostegui) Ah, from SAL I can see it was a maintenance reboot so I assume it is fine to start it again. [06:31:01] 10Analytics, 10Data-Engineering: dbstore1005 "staging" instance is down - https://phabricator.wikimedia.org/T321464 (10Marostegui) 05Open→03Resolved a:03Marostegui I have started it again after talking to Amir [06:45:03] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for bclwikiquote - https://phabricator.wikimedia.org/T316456 (10Marostegui) a:05Marostegui→03None This can now proceed. All fixed and the private data check was successful. [06:45:45] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for igwikiquote - https://phabricator.wikimedia.org/T314639 (10Marostegui) a:05Marostegui→03None This can now proceed. All fixed and the private data check was successful. [06:46:45] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for tlwikiquote - https://phabricator.wikimedia.org/T317111 (10Marostegui) a:05Marostegui→03None This can now proceed. All fixed and the private data check was successful. [06:53:01] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10Marostegui) [07:30:50] !log `elukey@stat1004:~$ sudo systemctl reset-failed jupyter-ntsako-singleuser.service` [07:30:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:32:31] !log `elukey@stat1005:~$ sudo systemctl reset-failed session-c4122.scope session-c4123.scope session-c4124.scope session-c4447.scope session-c4450.scope session-c4449.scope session-c4638.scope jupyter-echetty-singleuser.service` [07:32:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:50:08] Good morning elukey - thanks for the heads up on airflow - We also need to investigate and understand why sometimes it generates so many logs [07:53:22] joal: bonjour :) The number of logs is not that high, it is around few GBs of text, the main issue is the tiny root partition (~40G) with /srv, logs, etc.. all on it [07:59:33] right elukey - I wonder if there are not spikes of logs for not useful reasons, that we could tame [08:43:37] elukey: joal: Thanks for this. I'll create a ticket to add more disk space, but as joal says it would be useful to understand why they're generated and how we might mitigate that too. [08:45:09] 10Analytics, 10Data-Engineering: dbstore1005 "staging" instance is down - https://phabricator.wikimedia.org/T321464 (10Ladsgroup) Thanks <3 [08:45:19] At first glance, it looks like the vast majority of the logs come from a single DAG [08:45:22] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for igwikiquote - https://phabricator.wikimedia.org/T314639 (10Ladsgroup) Thanks! [08:45:24] https://www.irccloud.com/pastebin/woKJuQyI/ [08:50:52] btullis: this dag is from the search team - lets ask dcausse and ebernhardson [08:52:31] joal: Agreed. The dates are interesting too, because the largest files are in a folder named 2022-09-08 but the files contained were most recently written on Oct 22. This seems to overlap with more recent runs. [08:52:58] o/ [08:53:02] will take a look [08:54:59] dcausse: <3 Many thanks. Please do let us know if there's anything specific you'd like us to do. [09:07:15] also dcausse, I killed 4 mjolnir old jobs eariler this weekend - It looks like the jobs can last a very long time - is that expected? [09:07:58] joal: no and I think that explains why the logs are huge [09:08:28] seems to be the spark heart beat [09:10:30] weird :( [09:33:03] (03PS3) 10Matthias Mullie: Add schema for Extension:SearchVue actions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/845498 (https://phabricator.wikimedia.org/T321069) [09:33:31] (03CR) 10CI reject: [V: 04-1] Add schema for Extension:SearchVue actions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/845498 (https://phabricator.wikimedia.org/T321069) (owner: 10Matthias Mullie) [09:51:03] (03PS4) 10Matthias Mullie: Add schema for Extension:SearchVue actions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/845498 (https://phabricator.wikimedia.org/T321069) [11:13:42] (03PS1) 10Btullis: Update the email used for alerting the data engineering team [analytics/refinery] - 10https://gerrit.wikimedia.org/r/848269 (https://phabricator.wikimedia.org/T315486) [11:17:34] (03CR) 10Btullis: "Bringing to the attention of all data-engineering engineers, for visibility." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/848269 (https://phabricator.wikimedia.org/T315486) (owner: 10Btullis) [12:27:17] (03CR) 10Joal: [C: 04-1] "I think the only one valid here is the latest one for datahub. We're not gonna change the oozie ones are migrating away and those changes " [analytics/refinery] - 10https://gerrit.wikimedia.org/r/848269 (https://phabricator.wikimedia.org/T315486) (owner: 10Btullis) [12:30:26] CUSTOM - HDFS topology check on an-master1001 is OK: OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_topology_check [12:48:32] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 03): [SPIKE] Deploy event driven stateless Flink service to DSE cluster - https://phabricator.wikimedia.org/T320812 (10gmodena) Related https://phabricator.wikimedia.org/T318535 [12:58:54] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 03): Easy Flink Python UDF + SQL enrichment - https://phabricator.wikimedia.org/T320968 (10tchin) [13:06:53] (03CR) 10Matthias Mullie: Add schema for Extension:SearchVue actions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/845498 (https://phabricator.wikimedia.org/T321069) (owner: 10Matthias Mullie) [13:50:48] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr) Verified mgmt cables they are connected and have link [14:28:04] heya joal and aqu_ :] If you approve https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/176 I will go ahead and deploy [14:29:20] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 03): Easy Flink Python UDF + SQL enrichment - https://phabricator.wikimedia.org/T320968 (10Ottomata) Okay, think what we want is to abstract away Flink-y bit writing Flink UDFs in python. (and also probably in Java/Scala but that is not this t... [14:39:24] Hi mforns - good for me :) [14:44:35] 10Data-Engineering-Planning: Make mediawiki-history page and user sorting complete for denormalization - https://phabricator.wikimedia.org/T321493 (10JAllemandou) [14:46:11] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 03): [SPIKE] Deploy event driven stateless Flink service to DSE cluster - https://phabricator.wikimedia.org/T320812 (10gmodena) Related https://phabricator.wikimedia.org/T321491 [14:49:31] joal: thanks!!! [15:10:09] joal, will deploy refinery too, since it's all in the etherpad. [15:12:13] !log starting refinery regular weekly deploy [15:12:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:24:09] btullis, joal: :S there has just been an error at the refinery deploy, at refinery-deploy-to-hdfs stage... [15:24:33] 22/10/24 15:23:04 WARN hdfs.DataStreamer: Caught exception [15:24:33] java.lang.InterruptedException [15:24:33] at java.lang.Object.wait(Native Method) [15:24:33] at java.lang.Thread.join(Thread.java:1257) [15:24:33] at java.lang.Thread.join(Thread.java:1331) [15:24:33] at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:980) [15:24:33] at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:630) [15:24:34] at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:807) [15:24:46] And the script is hanging [15:25:12] ok, now it finished [15:25:37] the error was launched when attempting: hdfs dfs -D fs.permissions.umask-mode=022 -cp hdfs:///wmf/refinery/2022-10-24T15.21.22+00.00--scap_sync_2022-10-24_0001 hdfs:///wmf/refinery/current.tmp [15:27:04] trying again [15:28:45] 10Data-Engineering-Planning, 10SRE, 10SRE-swift-storage, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10dcausse) @bking I see that the doc has been updated, can we move this ticket to the Needs reporting column? [15:30:12] !log deployed airflow-dags as part of weekly train [15:30:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:30:37] ok, btullis, joal, this time the refinery deployment worked [15:30:53] !log finished deploying refinery as part of the weekly train [15:30:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:38:43] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 03), 10Technical-Debt: Create and deploy the fsimage job. - https://phabricator.wikimedia.org/T321168 (10JArguello-WMF) [15:48:26] 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-Core-Hooks: Add $comment and $performer to ArticleRevisionVisibilitySet params - https://phabricator.wikimedia.org/T321411 (10Ottomata) [16:19:58] !log `chown analytics-deploy /srv/deployment/analytics` on clouddumps100[1-2] [16:19:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:22:19] thanks btullis for the quick patch on clouddumps machine for scap <3 [16:24:37] You're welcome. I changed the permissions manually, but will retrospectively set it with puppet. [17:11:13] 10Data-Engineering, 10Data Pipelines: refinery scap deployment to thin nodes is broken - https://phabricator.wikimedia.org/T321506 (10mforns) [17:16:14] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 03), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) Ok, the annoying thing is that when deleting and checking the 'suppress data' box, the revisi... [17:20:29] Hi btullis - we receive alerts from the test cluster that we previously weren't receiving I think - let's talk about that tomorrow if you wish [17:25:12] joal: yes, let's. [18:45:58] (03CR) 10Mforns: [C: 03+1] "LGTM! Left 2 suggestions, but they are just an attempt at improving the readability of the boolean conditions. And not sure they are succe" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/842922 (https://phabricator.wikimedia.org/T318589) (owner: 10Joal) [19:02:51] (03CR) 10CI reject: [V: 04-1] Add new mediawiki state entity and change fragments, and use them in new mediawiki page change schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/807565 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata) [19:04:38] (03PS19) 10Ottomata: Add new mediawiki state entity and change fragments, and use them in new mediawiki page change schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/807565 (https://phabricator.wikimedia.org/T308017) [19:09:26] (03CR) 10Ottomata: Add new mediawiki state entity and change fragments, and use them in new mediawiki page change schema (035 comments) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/807565 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata) [19:10:08] (03PS20) 10Ottomata: Add new mediawiki state entity and change fragments, and use them in new mediawiki page change schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/807565 (https://phabricator.wikimedia.org/T308017) [19:20:44] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 03), 10Platform Team Initiatives (Modern Event Platform (TEC2)): Allow disabling/enabling configured streams via wgEventStreams config - https://phabricator.wikimedia.org/T259712 (10Ottomata) Hm, I wonder, if instead of doing a top level 'enabled: fal... [21:25:18] 10Analytics, 10Platform Engineering, 10Patch-For-Review: Add log entry details to page and user events in EventBus - https://phabricator.wikimedia.org/T263055 (10Ottomata) @Milimetric should we abandon the attached gerrit changes? [22:27:53] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 03), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10daniel) >>! In T308017#8339347, @Ottomata wrote: > It would be better if the Hook gave me a RevisionRec...