[01:08:03] <wikibugs>	 (03PS1) 10Neil Shah-Quinn (WMF): movement_metrics: Add Wikifunctions to queried database groups [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/959367 (https://phabricator.wikimedia.org/T346966)
[02:27:42] <jinxer-wm>	 (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:27:42] <jinxer-wm>	 (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:06:34] <brouberol>	 The kafka-jumbo rolling restart finished yesterday. Each broker is now fully working and UDP (under replicated partitions) count is down to 0
[08:08:11] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol)
[08:10:08] <brouberol>	 !log redeploying eventgate-analytics in staging T336041
[08:10:13] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:10:13] <stashbot>	 T336041: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041
[08:16:27] <elukey>	 brouberol: o/ one suggestion - when you deploy services like eventgate etc.. (that are on wikikube), drop a line in #wikimedia-serviceops
[08:16:45] <elukey>	 it is not necessary but stuff like eventgate-main may impact job queues etc..
[08:16:51] <elukey>	 so they know basically
[08:24:27] <brouberol>	 noted thanks!
[08:32:24] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Dashiki, 10Data Products: Windows 11 missing in analytics ? - https://phabricator.wikimedia.org/T346890 (10TheDJ) >>! In T346890#9184336, @Mayakp.wiki wrote: > This feels like an effect of Chrome's UA reduction where in Phase 5, the device OS was replaced. See Rollout d...
[08:56:00] <joal>	 !log Rerun edit-hourly druid indexation to fix corrupted data file
[08:56:02] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:29:40] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[09:39:40] <jinxer-wm>	 (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[09:43:23] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol)
[09:45:57] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol) 05Open→03Resolved
[10:09:23] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:12:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:16:23] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:17:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:26:13] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:27:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:32:33] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:32:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:55:21] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:57:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:00:20] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:02:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:58:30] <brouberol>	 moritzm: as part of https://phabricator.wikimedia.org/T346763, I find myself in need of packaging external opensource tooling as a debian package, so we can use it in routine kafka operations. Would you have time in the coming days to share knowledge on debian packaging to a neophyte? Thank you!
[12:05:03] * brouberol is afk for about 1h
[12:11:00] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] "I got a tentative +1 from Ben in slack, so that's good enough to try and deploy this.  I'm going to roll back if something breaks.  (Jenki" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/958945 (https://phabricator.wikimedia.org/T342213) (owner: 10Milimetric)
[12:12:35] <moritzm>	 brouberol: yeah, sure thing. we can look into this next week
[12:16:43] <wikibugs>	 (03PS1) 10Milimetric: Update aqs to 69ded27 [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/959727
[12:17:01] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update aqs to 69ded27 [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/959727 (owner: 10Milimetric)
[12:23:06] <wikibugs>	 (03PS13) 10Btullis: Update to Superset version 2.0.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356)
[12:39:43] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10gmodena) >  I'll leave it to @gmodena to test that the settings have been satifactorily applied before re...
[12:51:23] <brouberol>	 moritzm: thanks!
[12:53:36] <wikibugs>	 (03CR) 10Milimetric: "This looks great, tried some different spark configs, and this one seems to be the winner:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/957899 (https://phabricator.wikimedia.org/T309738) (owner: 10Ladsgroup)
[12:58:52] <wikibugs>	 (03PS4) 10Ladsgroup: Introduce MostTranscludedPages.hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/957899 (https://phabricator.wikimedia.org/T309738)
[13:12:24] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineer...
[13:42:13] <icinga-wm>	 PROBLEM - Check systemd state on archiva1002 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:42:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) nginx.service Failed on archiva1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:43:17] <icinga-wm>	 PROBLEM - HTTPS on archiva1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Archiva
[13:49:07] <icinga-wm>	 RECOVERY - Check systemd state on archiva1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:50:11] <icinga-wm>	 RECOVERY - HTTPS on archiva1002 is OK: SSL OK - Certificate archiva.wikimedia.org valid until 2023-11-29 22:21:23 +0000 (expires in 69 days) https://wikitech.wikimedia.org/wiki/Analytics/Systems/Archiva
[13:51:14] <wikibugs>	 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, and 2 others: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10gmodena)
[13:52:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) nginx.service Failed on archiva1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:35:12] <wikibugs>	 10Data-Engineering, 10Structured-Data-Backlog: NEW BUG REPORT fiwiki’s section-level image suggestions aren’t generated in production - https://phabricator.wikimedia.org/T343844 (10mfossati) @xcollazo , I deleted those old `VariableProperties`.
[14:36:29] <wikibugs>	 10Data-Engineering, 10Structured-Data-Backlog: NEW BUG REPORT fiwiki’s section-level image suggestions aren’t generated in production - https://phabricator.wikimedia.org/T343844 (10mfossati) > The DAGs usually start on Thursdays: I'll verify their outputs before closing this bug. Well, seems like they've not s...
[14:40:11] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:42:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:45:45] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:47:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:01:36] <wikibugs>	 (03CR) 10Mforns: [V: 03+2 C: 03+2] "Thank you Sam, for the awesome script and effort!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/755724 (owner: 10Awight)
[15:02:36] <milimetric>	 !log deployed aqs 1.0 to enable etags on all endpoints - so far everything looks ok
[15:02:37] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:12:08] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1118 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:12:20] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:12:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) hadoop-yarn-nodemanager.service Failed on an-worker1118:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:15:53] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Dashiki, 10Data Products: Windows 11 missing in analytics ? - https://phabricator.wikimedia.org/T346890 (10Milimetric) Ok, so the action here would be to label the data better, and add an annotation for Phase 5 and any other big changes.
[15:16:25] <brouberol>	 Looking at the logs on an-worker1118, that was caused by a shortage of heapspace 
[15:16:25] <brouberol>	 2023-09-21 15:09:31,751 WARN org.sparkproject.io.netty.channel.AbstractChannelHandlerContext: An exception 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
[15:16:25] <brouberol>	 java.lang.OutOfMemoryError: Java heap space
[15:16:50] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1118 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:17:02] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:17:11] <brouberol>	 ^ I've restarted the process. I wonder why systemd didn't do it istelf
[15:17:55] <brouberol>	 ah well, `Restart=no` in the systemd service config
[15:17:56] <btullis>	 I don't think that systemd is set to restart it, but puppet would probably have done so within 30 minutes. 
[15:18:06] <wikibugs>	 (03CR) 10Mforns: [V: 03+2] Remove queries for deprecated mobile_apps jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/931959 (https://phabricator.wikimedia.org/T329310) (owner: 10Mforns)
[15:19:09] <btullis>	 If you're on the host, you could look at why it might have failed. I suspect oom-killer but have also seen segfault in the past.
[15:19:59] <btullis>	 https://usercontent.irccloud-cdn.com/file/p6rblSOj/image.png
[15:20:02] <btullis>	 https://grafana-rw.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=an-worker1118&var-datasource=thanos&var-cluster=analytics
[15:20:32] <brouberol>	 looking at the logs, I found a java heap space error (cf my previous message)
[15:22:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) hadoop-yarn-nodemanager.service Failed on an-worker1118:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:25:50] <btullis>	 Just a guess at this stage, but I suspect this job, which is running with 58% of the Hadoop cluster resources. https://yarn.wikimedia.org/cluster/app/application_1694521537759_47834
[15:26:18] <wikibugs>	 10Data-Engineering: NEW BUG REPORT Some DAG run attempts fail because File *_temporary/0 does not exist. - https://phabricator.wikimedia.org/T347076 (10mpopov)
[15:26:45] <btullis>	 correction: 58% of queue, 20% of cluster resources.
[15:28:21] <wikibugs>	 (03CR) 10Mforns: [C: 03+1] "Is Cassandra still loading both clusters?" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/681682 (https://phabricator.wikimedia.org/T280649) (owner: 10Joal)
[15:30:06] <wikibugs>	 10Data-Engineering: NEW BUG REPORT Some DAG run attempts fail because File *_temporary/0 does not exist. - https://phabricator.wikimedia.org/T347076 (10mpopov) @Milimetric and I have a hypothesis that what's happening here is a race condition where the multiple concurrent runs of a DAG are all using the same tem...
[16:00:52] <brouberol>	 Is anyone is up for a bit of python/kafka review, https://gerrit.wikimedia.org/r/c/operations/puppet/+/959162 and https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/959720 work together to improve both the runtime and reliability of kafka rolling restarts, by making sure the broker we just restarted is back in full sync before proceeding to
[16:00:52] <brouberol>	 the next one. Thank you !
[16:23:15] <wikibugs>	 10Data-Engineering, 10Data Products, 10Wikidata, 10Wikidata-Query-Service: Publish WDQS JNL files to dumps.wikimedia.org - https://phabricator.wikimedia.org/T344905 (10VirginiaPoundstone)
[16:38:31] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10BTullis) I've tried two more builds, but I'm still finding the same issue. I t...
[16:44:42] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10BTullis) Ahah! The different version comes from https://repo.anaconda.com/pkgs...
[16:44:44] <wikibugs>	 (03CR) 10Joal: Cleanup cassandra double loading (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/681682 (https://phabricator.wikimedia.org/T280649) (owner: 10Joal)
[17:12:36] <wikibugs>	 10Data-Platform-SRE, 10observability, 10Epic: Review alerting around Search update pipeline - https://phabricator.wikimedia.org/T346807 (10bking) Thanks Andrea and Leo!  I'm closing this one in favor of T346438 , but will subscribe y'all on that ticket.
[17:12:51] <wikibugs>	 10Data-Platform-SRE, 10observability, 10Epic: Review alerting around Search update pipeline - https://phabricator.wikimedia.org/T346807 (10bking) 05Open→03Declined
[17:12:54] <wikibugs>	 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10bking)
[17:44:41] <wikibugs>	 (03CR) 10Mforns: [C: 03+2] Remove unused cassandra module [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/940154 (owner: 10Joal)
[17:59:38] <xcollazo>	 !log Deploy latest DAGs to analytics Airflow instance
[17:59:40] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:13:57] <wikibugs>	 (03CR) 10Hghani: "+1 from me" [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/959367 (https://phabricator.wikimedia.org/T346966) (owner: 10Neil Shah-Quinn (WMF))
[18:19:31] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking)
[18:54:37] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10VRiley-WMF) an-master1003 - C 6. U 12. port 09 CableID 3193 an-master1004 - D 8. U 36. port 35 CableID 2013339101850
[19:22:42] <jinxer-wm>	 (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:22:58] <wikibugs>	 (03CR) 10Hghani: [V: 03+2 C: 03+2] "Looks good to me" [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/959367 (https://phabricator.wikimedia.org/T346966) (owner: 10Neil Shah-Quinn (WMF))
[19:33:14] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10xcollazo) Interesting!  One observation:  We lock down the `conda-environment....
[19:54:52] <wikibugs>	 10Data-Platform-SRE: Write new partman recipe for cloudelastic (jbod) and update relevant Elastic config - https://phabricator.wikimedia.org/T342463 (10bking)
[19:55:29] <wikibugs>	 10Data-Platform-SRE, 10Elasticsearch, 10Discovery-Search (Current work): Change partitioning scheme for elasticsearch from RAID to JBOD - https://phabricator.wikimedia.org/T231010 (10bking) a:03bking
[19:59:23] <wikibugs>	 10Data-Platform-SRE, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Change partitioning scheme for elasticsearch from RAID to JBOD - https://phabricator.wikimedia.org/T231010 (10bking)
[20:04:26] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking)
[21:00:58] <wikibugs>	 10Data-Platform-SRE, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Change partitioning scheme for elasticsearch from RAID to JBOD - https://phabricator.wikimedia.org/T231010 (10bking)
[21:14:55] <wikibugs>	 10Data-Platform-SRE, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Change partitioning scheme for elasticsearch from RAID to JBOD - https://phabricator.wikimedia.org/T231010 (10bking)
[22:33:20] <wikibugs>	 10Data-Engineering, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Movement-Insights, 10WMDE-FUN-Sprint-2023-09-04: Unique Devices seasonal trends on small projects - https://phabricator.wikimedia.org/T344381 (10Mayakp.wiki) Thanks for confirming @kai.nissen !  I checked our dashboards an...
[23:22:42] <jinxer-wm>	 (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:58:29] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2): Document data pipeline and data set ownership - https://phabricator.wikimedia.org/T346295 (10Ahoelzl) Next step, define temporary ownership of DE pipelines to meet DQ goals and develop platform.