[06:08:09] <wikibugs>	 10Data-Engineering, 10Data-Catalog, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10odimitrijevic) @BTullis we'll need the SRE team's help with the deployment of the event platform schema ingestio...
[06:18:07] <wikibugs>	 10Data-Engineering, 10cloud-services-team, 10Cloud-Services-Origin-User: WMCS-roots paging responsibilities - https://phabricator.wikimedia.org/T344608 (10Marostegui) There are two possible reasons: 1) I caught the alert too fast before it even paged. 2) The host is marked as non critical (aka irc-alert only...
[07:21:42] <jinxer-wm>	 (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:26:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:43:41] <wikibugs>	 (03PS1) 10Peter Fischer: Reuse existing schema fragments for redirects. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/951829 (https://phabricator.wikimedia.org/T325315)
[07:45:55] <wikibugs>	 (03CR) 10Peter Fischer: "I hope it's okay to not bump the major version despite introducing breaking changes, since we have not deployed the application that would" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/951829 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer)
[08:22:19] <btullis>	 !log beginning a rolling reboot of  kafka-jumbo
[08:22:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:28:22] <wikibugs>	 10Data-Platform-SRE: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10BTullis)
[08:40:31] <wikibugs>	 10Data-Platform-SRE: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10Gehel)
[08:41:30] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10Gehel) a:03bking
[08:43:28] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Consider using git-lfs for elastic plugins repo - https://phabricator.wikimedia.org/T344462 (10Gehel) p:05Triage→03Medium
[08:49:57] <wikibugs>	 10Data-Platform-SRE, 10sre-alert-triage: Alert: Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - https://phabricator.wikimedia.org/T343318 (10BTullis) 05Open→03Resolved a:03BTullis Thanks @gmodena
[08:50:42] <jinxer-wm>	 (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:55:27] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Products, 10Scap: analytics/refinery: Stop using git-fat - https://phabricator.wikimedia.org/T328472 (10JAllemandou) >>! In T328472#9110768, @xcollazo wrote: > I would rephrase the problem as: Why do we need to keep generated artifacts inside our version cont...
[08:55:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:57:29] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10User-MoritzMuehlenhoff: Configure the Hadoop MapReduce ports to use a fixed range - https://phabricator.wikimedia.org/T111433 (10BTullis)
[09:30:57] <gehel>	 btullis: I've made our backlog grooming meeting weekly, for until we catch up a bit
[09:31:37] <btullis>	 gehel: Ack, thanks.
[09:48:00] <wikibugs>	 10Data-Platform-SRE, 10Product-Analytics: Allow connections to presto UI port - https://phabricator.wikimedia.org/T331455 (10BTullis) I think that there would be a privacy concern if we were to do this. There's no authentication or authorization in the Presto UI (See https://github.com/prestodb/presto/issues/1...
[09:48:21] <wikibugs>	 10Data-Platform-SRE, 10Product-Analytics: Allow connections to presto UI port - https://phabricator.wikimedia.org/T331455 (10BTullis) p:05Triage→03Low
[09:53:32] <wikibugs>	 10Data-Platform-SRE: Allow users to differentiate their JupyterHub logs in Logstash - https://phabricator.wikimedia.org/T293243 (10BTullis) 05Open→03Declined Having reviewed this, I'm not sure that there is much of a need. We can do any filtering that we need with normal logstash queries.
[09:57:49] <awight>	 I just noticed that the mailing list summary needs updates, it links to dead gmane URLs https://lists.wikimedia.org/postorius/lists/analytics.lists.wikimedia.org/
[10:00:04] <btullis>	 awight: Thanks for pointing that out. I think I can update it. We just need to replace postoris with hyperkitty, don't we?
[10:00:14] <btullis>	 postorius
[10:07:25] <wikibugs>	 10Data-Engineering, 10Anti-Harassment, 10Growth-Team, 10MediaWiki-extensions-EventLogging, and 5 others: [EPIC] Deprecate EventLogging::logEvent() - https://phabricator.wikimedia.org/T318263 (10phuedx)
[10:07:31] <wikibugs>	 10Data-Platform-SRE, 10Product-Analytics: Allow connections to presto UI port - https://phabricator.wikimedia.org/T331455 (10JAllemandou) Given the proposed solution from Andrew I don't think there'd be more privacy issues than with Hadoop, i.e: cluster users can see others folks jobs. If the presto UI is hidd...
[11:09:42] <jinxer-wm>	 (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:10:26] <wikibugs>	 10Data-Platform-SRE, 10Product-Analytics: Allow connections to presto UI port - https://phabricator.wikimedia.org/T331455 (10BTullis) >>! In T331455#9112821, @JAllemandou wrote: > Given the proposed solution from Andrew I don't think there'd be more privacy issues than with Hadoop, i.e: cluster users can see o...
[11:11:35] <icinga-wm>	 PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[11:11:47] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:13:55] <wikibugs>	 10Data-Platform-SRE: Investigate an-presto1002 failures - https://phabricator.wikimedia.org/T344808 (10BTullis)
[11:14:44] <btullis>	 I have created https://phabricator.wikimedia.org/T344808 to investigate why an-presto1002 keeps alerting about presto-server failing.
[11:19:40] <wikibugs>	 10Data-Platform-SRE: Investigate an-presto1002 failures - https://phabricator.wikimedia.org/T344808 (10BTullis) p:05Triage→03High a:03BTullis
[11:20:19] <icinga-wm>	 RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[11:20:31] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:21:16] <wikibugs>	 (03CR) 10Gmodena: cirrussearch: add fetch_failure schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse)
[11:21:41] <wikibugs>	 10Data-Platform-SRE, 10Product-Analytics: Allow connections to presto UI port - https://phabricator.wikimedia.org/T331455 (10JAllemandou) We have similar issues with yarn, although a bit different since we differenciate prod users, but for regular users, IIRC one can kill another job.
[11:24:37] <icinga-wm>	 PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[11:24:49] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:29:03] <icinga-wm>	 RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[11:29:15] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:33:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:33:27] <wikibugs>	 10Data-Platform-SRE: Investigate an-presto1002 failures - https://phabricator.wikimedia.org/T344808 (10BTullis) I couldn't find any indication from `/var/log/syslog` of why the memory spikes occurred. Similarly, there was nothing useful in `dmesg -T`. The process table was dumped each time, showing that it was d...
[11:38:12] <jinxer-wm>	 (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:39:44] <awight>	 btullis: correction, https://lists.wikimedia.org/hyperkitty/list/analytics@lists.wikimedia.org/
[11:39:55] <wikibugs>	 10Data-Platform-SRE: Investigate an-presto1002 failures - https://phabricator.wikimedia.org/T344808 (10BTullis) It happened again after a fresh boot. ` btullis@an-presto1002:~$ uptime  11:39:12 up 11 min,  1 user,  load average: 4.15, 10.39, 5.26 btullis@an-presto1002:~$ systemctl list-units --state failed    UN...
[11:40:47] <icinga-wm>	 PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[11:40:59] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:01] <btullis>	 awight: Thanks. I removed all references to gmane and NNTP mode from the list description. WOuld you say that's ok now?
[11:43:37] <awight>	 btullis: Short and sweet!  As an outsider, I would love a sentence about what to expect on this list, eg. "updates on our work, service maintenance notifications, and discsusions about new features"...
[11:44:06] <awight>	 (just guessing what might be here since I really am an outsider ;-)
[11:46:49] * btullis awight: Thanks again for your input. I'll look again to improve it. 
[11:47:16] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) Added the `AUTH_OIDC_CLIENT_AUTHENTICATION_METHOD ` method and retested, the idp seems okay with everything, my user is authenticated and provided a...
[11:49:33] <icinga-wm>	 RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[11:49:47] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:58:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) druid-historical.service Failed on druid1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:05:31] <icinga-wm>	 PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[12:05:43] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:20:03] <icinga-wm>	 RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[12:20:17] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:21:21] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) Thanks @Stevemunene - Good work. I can confirm that when I attempt to log into staging.
[12:23:56] <wikibugs>	 10Data-Platform-SRE: Investigate an-presto1002 failures - https://phabricator.wikimedia.org/T344808 (10BTullis) Tailing the log of the presto-server shows nothing apart from a gap between the process finishing its initilization and the next time it is started. ` btullis@an-presto1002:/var/log/presto$ tail -f ser...
[12:28:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ...
[12:28:27] <jinxer-wm>	 The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning
[12:33:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichTaskManagerNotRunning) resolved: ...
[12:33:27] <jinxer-wm>	 The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning
[12:36:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ...
[12:36:27] <jinxer-wm>	 The mw-page-content-change-enrich Flink cluster in eqiad has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning
[12:39:09] <btullis>	 gmodena: ^ Anything I can do to help here, or have you got this? I wasn't worried about the codfw one, but then it fired in eqiad too.
[12:40:42] <btullis>	 joal: I'd like to go for a reboot of an-launcher1002 today, if possible. Are you OK with that? I can time it between gobblin/refine jub runs.
[12:41:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichTaskManagerNotRunning) resolved: (2) The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO  - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning
[12:55:43] <gmodena>	 btullis ack. I was deploying the service with joal. Everything is operational in all DCs. Let me check the alerts, because the downtime was shorter than the expected alerting threshold.
[13:06:30] <btullis>	 gmodena: Great, thanks. 
[13:47:03] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Epic, 10Event-Platform: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10gmodena)
[14:08:25] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission db1108.eqiad.wmnet - https://phabricator.wikimedia.org/T336254 (10Jclark-ctr) 05Open→03Resolved
[14:16:59] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1117.eqiad.wmnet with OS bullseye executed with errors: - an-worker1117 (**FAIL*...
[14:35:10] <wikibugs>	 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10BTullis) p:05Triage→03High
[14:43:42] <wikibugs>	 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10BTullis) Another update: Spark 3.3.3 has just been released, as of August 21st. https://spark.apache.org/releases/spark-release-3-3-3.html  Should I s...
[14:44:14] <btullis>	 Head-up, I'm going to be rebooting an-launcher1002 in 6 minutes' time.
[14:46:34] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: mw-page-content-change-enrich: filter out events larger than max.request.size - https://phabricator.wikimedia.org/T342399 (10gmodena)
[14:47:31] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) Found some info on what we might be missing, We have so far verified that authentication on the IDP side is okay and that we do receive a signed id t...
[14:50:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) druid-broker.service Failed on an-druid1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:50:27] <btullis>	 !log rebooting an-launcher1002
[14:50:28] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:55:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: (10) druid-broker.service Failed on an-druid1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:57:59] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1117.eqiad.wmnet with OS bullseye
[15:01:27] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) druid-historical.service Failed on an-druid1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:05:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: (5) druid-historical.service Failed on an-druid1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:06:39] <icinga-wm>	 PROBLEM - Host an-druid1004 is DOWN: PING CRITICAL - Packet loss = 100%
[15:06:45] <icinga-wm>	 RECOVERY - Host an-druid1004 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms
[15:07:01] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr)
[15:44:37] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1117.eqiad.wmnet with OS bullseye executed with errors: - an-worker1117 (**FAIL**)   - Downtimed on Ic...
[15:54:07] <btullis>	 I'm going to go for a failover of the Hadoop nameserver from an-master1001 to an-master1002, to facilitate rebooting an-master1001.
[15:59:53] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Products, 10Scap: analytics/refinery: Stop using git-fat - https://phabricator.wikimedia.org/T328472 (10xcollazo) >  Currently this is done through the refine-repo deployment (on stats machine and on HDFS). Before removing the artifacts from the refine repo o...
[16:01:51] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) In case it helps, I did a little digging into the CAS logs on idp-test1002 and stumbled upon this, which might help. ` root@idp-test1002:/var/log/cas# gr...
[16:04:26] <wikibugs>	 (03CR) 10Clare Ming: "This change is ready for review." (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951560 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming)
[16:37:35] <wikibugs>	 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10xcollazo) Re 3.3 vs 3.4, I am yet do do any tests on 3.4.  But actually, @BTullis , since I suspect that my current blocking issue (T340861#9101939) i...
[16:38:25] <wikibugs>	 10Data-Platform-SRE: Investigate an-presto1002 failures - https://phabricator.wikimedia.org/T344808 (10BTullis) 05Open→03Resolved It hasn't happened for four hours, so I'm tempted to mark this as resolved, even though we haven't ascertained the root cause. If it happens again, we can reopen it.
[17:42:58] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Create and publish new elastic dev image - https://phabricator.wikimedia.org/T344841 (10bking)
[17:44:20] <wikibugs>	 (03CR) 10Phuedx: Experiment with including fragments inside data objects. (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951560 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming)
[17:48:00] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Create and publish new elastic dev image - https://phabricator.wikimedia.org/T344841 (10bking) I have a merge request ready to go, but I don't have permission to push. I clicked the "request permission" button in Gitlab, but in the meantime @dcausse  's patch [[ https:/...
[17:57:47] <wikibugs>	 (03Abandoned) 10Clare Ming: Experiment with including fragments inside data objects. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951560 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming)
[19:01:38] <wikibugs>	 (03PS4) 10Clare Ming: Add Metrics Platform fragments by platform only [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557)
[19:02:22] <wikibugs>	 (03CR) 10Clare Ming: Add Metrics Platform fragments by platform only (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming)
[19:35:29] <wikibugs>	 10Data-Engineering, 10Product-Analytics: Email notifications of new MediaWiki history snapshot availabilty - https://phabricator.wikimedia.org/T344854 (10mpopov)
[20:37:38] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search, 10Patch-For-Review: Create and publish new elastic dev image - https://phabricator.wikimedia.org/T344841 (10CodeReviewBot) bking opened https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/48  elasticsearch: Update wmf-elasticsearch-search-plugins
[20:55:48] <icinga-wm>	 PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[20:56:40] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:19:48] <icinga-wm>	 RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[21:20:58] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:28:47] <wikibugs>	 (03PS5) 10Clare Ming: Add Metrics Platform fragments by platform only [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557)
[22:48:27] <wikibugs>	 10Data-Engineering, 10Dumps 2.0, 10Data Products (Sprint 00): Develop Dumps Triage Runbook - https://phabricator.wikimedia.org/T343325 (10VirginiaPoundstone) p:05Triage→03High