[06:08:09] 10Data-Engineering, 10Data-Catalog, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10odimitrijevic) @BTullis we'll need the SRE team's help with the deployment of the event platform schema ingestio... [06:18:07] 10Data-Engineering, 10cloud-services-team, 10Cloud-Services-Origin-User: WMCS-roots paging responsibilities - https://phabricator.wikimedia.org/T344608 (10Marostegui) There are two possible reasons: 1) I caught the alert too fast before it even paged. 2) The host is marked as non critical (aka irc-alert only... [07:21:42] (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:26:42] (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:43:41] (03PS1) 10Peter Fischer: Reuse existing schema fragments for redirects. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/951829 (https://phabricator.wikimedia.org/T325315) [07:45:55] (03CR) 10Peter Fischer: "I hope it's okay to not bump the major version despite introducing breaking changes, since we have not deployed the application that would" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/951829 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [08:22:19] !log beginning a rolling reboot of kafka-jumbo [08:22:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:28:22] 10Data-Platform-SRE: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10BTullis) [08:40:31] 10Data-Platform-SRE: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10Gehel) [08:41:30] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10Gehel) a:03bking [08:43:28] 10Data-Platform-SRE, 10Discovery-Search: Consider using git-lfs for elastic plugins repo - https://phabricator.wikimedia.org/T344462 (10Gehel) p:05Triage→03Medium [08:49:57] 10Data-Platform-SRE, 10sre-alert-triage: Alert: Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - https://phabricator.wikimedia.org/T343318 (10BTullis) 05Open→03Resolved a:03BTullis Thanks @gmodena [08:50:42] (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:55:27] 10Data-Engineering, 10Data-Platform-SRE, 10Data Products, 10Scap: analytics/refinery: Stop using git-fat - https://phabricator.wikimedia.org/T328472 (10JAllemandou) >>! In T328472#9110768, @xcollazo wrote: > I would rephrase the problem as: Why do we need to keep generated artifacts inside our version cont... [08:55:42] (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:57:29] 10Data-Platform-SRE, 10SRE, 10User-MoritzMuehlenhoff: Configure the Hadoop MapReduce ports to use a fixed range - https://phabricator.wikimedia.org/T111433 (10BTullis) [09:30:57] btullis: I've made our backlog grooming meeting weekly, for until we catch up a bit [09:31:37] gehel: Ack, thanks. [09:48:00] 10Data-Platform-SRE, 10Product-Analytics: Allow connections to presto UI port - https://phabricator.wikimedia.org/T331455 (10BTullis) I think that there would be a privacy concern if we were to do this. There's no authentication or authorization in the Presto UI (See https://github.com/prestodb/presto/issues/1... [09:48:21] 10Data-Platform-SRE, 10Product-Analytics: Allow connections to presto UI port - https://phabricator.wikimedia.org/T331455 (10BTullis) p:05Triage→03Low [09:53:32] 10Data-Platform-SRE: Allow users to differentiate their JupyterHub logs in Logstash - https://phabricator.wikimedia.org/T293243 (10BTullis) 05Open→03Declined Having reviewed this, I'm not sure that there is much of a need. We can do any filtering that we need with normal logstash queries. [09:57:49] I just noticed that the mailing list summary needs updates, it links to dead gmane URLs https://lists.wikimedia.org/postorius/lists/analytics.lists.wikimedia.org/ [10:00:04] awight: Thanks for pointing that out. I think I can update it. We just need to replace postoris with hyperkitty, don't we? [10:00:14] postorius [10:07:25] 10Data-Engineering, 10Anti-Harassment, 10Growth-Team, 10MediaWiki-extensions-EventLogging, and 5 others: [EPIC] Deprecate EventLogging::logEvent() - https://phabricator.wikimedia.org/T318263 (10phuedx) [10:07:31] 10Data-Platform-SRE, 10Product-Analytics: Allow connections to presto UI port - https://phabricator.wikimedia.org/T331455 (10JAllemandou) Given the proposed solution from Andrew I don't think there'd be more privacy issues than with Hadoop, i.e: cluster users can see others folks jobs. If the presto UI is hidd... [11:09:42] (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:10:26] 10Data-Platform-SRE, 10Product-Analytics: Allow connections to presto UI port - https://phabricator.wikimedia.org/T331455 (10BTullis) >>! In T331455#9112821, @JAllemandou wrote: > Given the proposed solution from Andrew I don't think there'd be more privacy issues than with Hadoop, i.e: cluster users can see o... [11:11:35] PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [11:11:47] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:55] 10Data-Platform-SRE: Investigate an-presto1002 failures - https://phabricator.wikimedia.org/T344808 (10BTullis) [11:14:44] I have created https://phabricator.wikimedia.org/T344808 to investigate why an-presto1002 keeps alerting about presto-server failing. [11:19:40] 10Data-Platform-SRE: Investigate an-presto1002 failures - https://phabricator.wikimedia.org/T344808 (10BTullis) p:05Triage→03High a:03BTullis [11:20:19] RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [11:20:31] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:16] (03CR) 10Gmodena: cirrussearch: add fetch_failure schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse) [11:21:41] 10Data-Platform-SRE, 10Product-Analytics: Allow connections to presto UI port - https://phabricator.wikimedia.org/T331455 (10JAllemandou) We have similar issues with yarn, although a bit different since we differenciate prod users, but for regular users, IIRC one can kill another job. [11:24:37] PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [11:24:49] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:03] RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [11:29:15] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:12] (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:33:27] 10Data-Platform-SRE: Investigate an-presto1002 failures - https://phabricator.wikimedia.org/T344808 (10BTullis) I couldn't find any indication from `/var/log/syslog` of why the memory spikes occurred. Similarly, there was nothing useful in `dmesg -T`. The process table was dumped each time, showing that it was d... [11:38:12] (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:39:44] btullis: correction, https://lists.wikimedia.org/hyperkitty/list/analytics@lists.wikimedia.org/ [11:39:55] 10Data-Platform-SRE: Investigate an-presto1002 failures - https://phabricator.wikimedia.org/T344808 (10BTullis) It happened again after a fresh boot. ` btullis@an-presto1002:~$ uptime 11:39:12 up 11 min, 1 user, load average: 4.15, 10.39, 5.26 btullis@an-presto1002:~$ systemctl list-units --state failed UN... [11:40:47] PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [11:40:59] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:01] awight: Thanks. I removed all references to gmane and NNTP mode from the list description. WOuld you say that's ok now? [11:43:37] btullis: Short and sweet! As an outsider, I would love a sentence about what to expect on this list, eg. "updates on our work, service maintenance notifications, and discsusions about new features"... [11:44:06] (just guessing what might be here since I really am an outsider ;-) [11:46:49] * btullis awight: Thanks again for your input. I'll look again to improve it. [11:47:16] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) Added the `AUTH_OIDC_CLIENT_AUTHENTICATION_METHOD ` method and retested, the idp seems okay with everything, my user is authenticated and provided a... [11:49:33] RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [11:49:47] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:12] (SystemdUnitFailed) resolved: (2) druid-historical.service Failed on druid1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:05:31] PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [12:05:43] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:03] RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [12:20:17] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:21] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) Thanks @Stevemunene - Good work. I can confirm that when I attempt to log into staging. [12:23:56] 10Data-Platform-SRE: Investigate an-presto1002 failures - https://phabricator.wikimedia.org/T344808 (10BTullis) Tailing the log of the presto-server shows nothing apart from a gap between the process finishing its initilization and the next time it is started. ` btullis@an-presto1002:/var/log/presto$ tail -f ser... [12:28:27] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ... [12:28:27] The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [12:33:27] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) resolved: ... [12:33:27] The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [12:36:27] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ... [12:36:27] The mw-page-content-change-enrich Flink cluster in eqiad has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [12:39:09] gmodena: ^ Anything I can do to help here, or have you got this? I wasn't worried about the codfw one, but then it fired in eqiad too. [12:40:42] joal: I'd like to go for a reboot of an-launcher1002 today, if possible. Are you OK with that? I can time it between gobblin/refine jub runs. [12:41:27] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) resolved: (2) The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [12:55:43] btullis ack. I was deploying the service with joal. Everything is operational in all DCs. Let me check the alerts, because the downtime was shorter than the expected alerting threshold. [13:06:30] gmodena: Great, thanks. [13:47:03] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Epic, 10Event-Platform: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10gmodena) [14:08:25] 10Data-Platform-SRE, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission db1108.eqiad.wmnet - https://phabricator.wikimedia.org/T336254 (10Jclark-ctr) 05Open→03Resolved [14:16:59] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1117.eqiad.wmnet with OS bullseye executed with errors: - an-worker1117 (**FAIL*... [14:35:10] 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10BTullis) p:05Triage→03High [14:43:42] 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10BTullis) Another update: Spark 3.3.3 has just been released, as of August 21st. https://spark.apache.org/releases/spark-release-3-3-3.html Should I s... [14:44:14] Head-up, I'm going to be rebooting an-launcher1002 in 6 minutes' time. [14:46:34] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: mw-page-content-change-enrich: filter out events larger than max.request.size - https://phabricator.wikimedia.org/T342399 (10gmodena) [14:47:31] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) Found some info on what we might be missing, We have so far verified that authentication on the IDP side is okay and that we do receive a signed id t... [14:50:12] (SystemdUnitFailed) firing: (8) druid-broker.service Failed on an-druid1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:50:27] !log rebooting an-launcher1002 [14:50:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:55:12] (SystemdUnitFailed) resolved: (10) druid-broker.service Failed on an-druid1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:57:59] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1117.eqiad.wmnet with OS bullseye [15:01:27] (SystemdUnitFailed) firing: (3) druid-historical.service Failed on an-druid1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:05:12] (SystemdUnitFailed) resolved: (5) druid-historical.service Failed on an-druid1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:06:39] PROBLEM - Host an-druid1004 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:45] RECOVERY - Host an-druid1004 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [15:07:01] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr) [15:44:37] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1117.eqiad.wmnet with OS bullseye executed with errors: - an-worker1117 (**FAIL**) - Downtimed on Ic... [15:54:07] I'm going to go for a failover of the Hadoop nameserver from an-master1001 to an-master1002, to facilitate rebooting an-master1001. [15:59:53] 10Data-Engineering, 10Data-Platform-SRE, 10Data Products, 10Scap: analytics/refinery: Stop using git-fat - https://phabricator.wikimedia.org/T328472 (10xcollazo) > Currently this is done through the refine-repo deployment (on stats machine and on HDFS). Before removing the artifacts from the refine repo o... [16:01:51] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) In case it helps, I did a little digging into the CAS logs on idp-test1002 and stumbled upon this, which might help. ` root@idp-test1002:/var/log/cas# gr... [16:04:26] (03CR) 10Clare Ming: "This change is ready for review." (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951560 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [16:37:35] 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10xcollazo) Re 3.3 vs 3.4, I am yet do do any tests on 3.4. But actually, @BTullis , since I suspect that my current blocking issue (T340861#9101939) i... [16:38:25] 10Data-Platform-SRE: Investigate an-presto1002 failures - https://phabricator.wikimedia.org/T344808 (10BTullis) 05Open→03Resolved It hasn't happened for four hours, so I'm tempted to mark this as resolved, even though we haven't ascertained the root cause. If it happens again, we can reopen it. [17:42:58] 10Data-Platform-SRE, 10Discovery-Search: Create and publish new elastic dev image - https://phabricator.wikimedia.org/T344841 (10bking) [17:44:20] (03CR) 10Phuedx: Experiment with including fragments inside data objects. (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951560 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [17:48:00] 10Data-Platform-SRE, 10Discovery-Search: Create and publish new elastic dev image - https://phabricator.wikimedia.org/T344841 (10bking) I have a merge request ready to go, but I don't have permission to push. I clicked the "request permission" button in Gitlab, but in the meantime @dcausse 's patch [[ https:/... [17:57:47] (03Abandoned) 10Clare Ming: Experiment with including fragments inside data objects. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951560 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [19:01:38] (03PS4) 10Clare Ming: Add Metrics Platform fragments by platform only [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) [19:02:22] (03CR) 10Clare Ming: Add Metrics Platform fragments by platform only (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [19:35:29] 10Data-Engineering, 10Product-Analytics: Email notifications of new MediaWiki history snapshot availabilty - https://phabricator.wikimedia.org/T344854 (10mpopov) [20:37:38] 10Data-Platform-SRE, 10Discovery-Search, 10Patch-For-Review: Create and publish new elastic dev image - https://phabricator.wikimedia.org/T344841 (10CodeReviewBot) bking opened https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/48 elasticsearch: Update wmf-elasticsearch-search-plugins [20:55:48] PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [20:56:40] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:19:48] RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [21:20:58] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:28:47] (03PS5) 10Clare Ming: Add Metrics Platform fragments by platform only [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) [22:48:27] 10Data-Engineering, 10Dumps 2.0, 10Data Products (Sprint 00): Develop Dumps Triage Runbook - https://phabricator.wikimedia.org/T343325 (10VirginiaPoundstone) p:05Triage→03High