[01:16:44] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 4.675% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [02:25:49] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10BTullis) It took a while, but I've got past a real blocker that I had with publishing via the trusted runners. I've now got what I feel is a good vanilla... [02:27:42] (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:24:55] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Update relforge elasticsearch instance extra plugin - https://phabricator.wikimedia.org/T353270 (10CodeReviewBot) hashar merged https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/56 elasticsearch: fix changelog entries [03:25:01] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Update relforge elasticsearch instance extra plugin - https://phabricator.wikimedia.org/T353270 (10CodeReviewBot) hashar opened https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/56 elasticsearch: fix changelog entries [05:16:44] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 4.504% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [06:27:42] (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:27:50] 10Quarry, 10Data-Services, 10cloud-services-team (FY2023/2024-Q1-Q2): Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407 (10SD0001) Hi @fnegri, did you get a chance to get to this? Thanks! [08:59:27] 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10Growth-Team, and 9 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10Krinkle) a:03xSavitar [09:01:07] 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10Growth-Team, and 9 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10Krinkle) a:05xSavitar→03DAlangi_WMF [09:01:21] 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10Growth-Team, and 9 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10Krinkle) [09:16:44] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 4.305% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:42:53] joal: just a heads up: I am merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/982761 soon, which will add newer firmware for AMD GPUs for all thus-enabled nodes (it's a bullseye backport for Bullseye machines only), since we need the newer package for the newer GPUs on Lift Wing. Shouldn't break anything for the older Vega GPUs. [09:43:43] Correction, https://gerrit.wikimedia.org/r/c/operations/puppet/+/982766 [10:00:03] 10Data-Platform-SRE (23/24 Q3 Milestone 1): Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 (10Gehel) [10:00:37] 10Data-Platform-SRE (23/24 Q3 Milestone 1): Upgrade Stats clients to bullseye - https://phabricator.wikimedia.org/T329360 (10Gehel) [10:00:50] 10Data-Platform-SRE (23/24 Q3 Milestone 1): Upgrade hadoop master to bullseye - https://phabricator.wikimedia.org/T332573 (10Gehel) [10:01:02] 10Data-Platform-SRE (23/24 Q3 Milestone 1): Upgrade hadoop standby master to bullseye - https://phabricator.wikimedia.org/T332578 (10Gehel) [10:01:32] 10Data-Platform-SRE (23/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10Gehel) [10:01:47] 10Data-Platform-SRE (23/24 Q3 Milestone 1): Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10Gehel) [10:16:36] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10BTullis) Getting there. This is from an SSH port forwarding session in the test cluster.{F41598247,width=50%} We're currently investigating who only four applications... [10:24:10] brouberol: o/ [10:24:45] I saw that https://phabricator.wikimedia.org/T351816 is closed, are we using the spark-history keytab atm or the analytics one? [10:25:34] I am asking since I think it would be good now to test the right spark-history keytab [10:25:38] to make sure it works fine etc.. [10:26:09] elukey: We're using spark-history ones, I believe. https://phabricator.wikimedia.org/T351816#9364086 [10:26:56] btullis: ack, I recall that the analytics one was mentioned during a meeting, this is why I am asking [10:27:43] (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:32:42] btullis: the base64 -d of the keytab returns [10:32:52] WIKIMEDIAanalytics"spark-history-test.svc.eqiad.wmneteq [10:33:05] IIRC there was an issue with access permissions on hdfs [10:33:14] spark-history needs to be able to fetch data etc.. [10:33:45] with superset we added it a super user IIRC, something that was able to proxy auth permissions [10:33:51] I am not sure if we need something similar [10:34:16] but spark history shouldn't run as "analytics" in my opinion [10:34:37] Oh yes, I remember the context. You must be right, we were going to add the spark-history posix user. I probably answered too soon. [10:36:18] yes yes don't worry I don't want to intrude, I was just curious, there are soo many inflight changes [10:37:21] 10Data-Engineering, 10CirrusSearch, 10Image-Suggestions, 10Structured-Data-Backlog, 10Discovery-Search (Current work): Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27 - https://phabricator.wikimedia.org/T353134 (10mfoss... [10:37:39] I think that the keytabs have been created, but we might still be using analytics for the current test that is running manually. I'll let b.rouberol give you a more definitive answer. [10:41:08] elukey: the ticket you mentioned is closed, but I re-opened https://phabricator.wikimedia.org/T352838#9390972 to make sure we track that we should create the posix spark-history user. I didn't think to re-open the other one as well [10:41:56] ack! [10:42:18] To avoid mixing many different things, I left it as is atm to focus on the chart/helmfiles development (all working now thanks to you, jayme and btullis), so this will be back on my plate very soon! [10:42:53] yes yes! I wanted to add some info before I will be afk, in profile::hadoop::common there are hadoop.proxyuser.etc.. [10:43:05] I am not sure if it will be the right road for spark history [10:43:15] but the permissions problem will like resurface [10:44:21] (proxyuser allows a specific user to impersonate another one) [11:00:15] I'm currently doing a pass over hosts in insetup* roles to identify any Puppet5/Puppet7 gotchas (like insetup roles defaulting to Puppet 7 which then get a role applied which doesn't default to Puppet 5 yet, causing cert issues) [11:00:28] for Data Engineering there's two things I'd do: [11:01:48] 1. for stat1010/stat1010 I'll add Hiera host entries to force them to Puppet 7 (they currently already on Puppet 7 due to the insetup role), but the stat* role isn't yet since we still have buster nodes. Those Hiera host entries will ensure the service setup will cotinue to use Puppet 7 (we already have stat1009 on Puppet 7) [11:03:53] 2. there are new kafka-stretch* hosts, I suppose these will use a dedicated kafka role eventually, like role::kafka::stretch? then I'd create an initial stub role (what is also provided by the insetup roles like profile::firewall etc and have that default to Puppet via Hiera role settings, also ensuring that the setup sticks with Puppet 7 [11:05:05] we tried using proxy user, but apparently spark history itself does not support it [11:05:14] the other insetup hosts are fine, they are for roles which already default to Puppet 7 [11:13:26] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Run a spark job in test to make sure the history server can see the job data - https://phabricator.wikimedia.org/T352882 (10BTullis) In case it helps, there are some useful resources in [[https://docs.google.com/presentation/d/1P6XPqL... [11:14:49] brouberol: setting it and restarting hdfs namenodes etc..? [11:17:06] yep, I tried that for the test analytics cluster. That apparently only work if the application performs impersonation (the `doAs` call in https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Superusers.html) [11:17:07] moritzm: Thanks so much. For the stat101[0-1] servers that's great. It looks like the kafka-stretch cluster isn't going ahead, so these servers will likely be repurposed. https://phabricator.wikimedia.org/T340492#9390367 - So they can stay in insetup until we decom->rename->recom them, I think. [11:18:11] elukey: I tested adding some proxy users for the test analytics cluster, ran the rolling-restart cookbook for the namenodes, to take the new config into account, ran the spark hsitory server with the proxy user, and I was still getting HDFS permission errors [11:18:56] I have documented that somewhere in Phab, let me have a look [11:19:54] btullis: ack, sounds good [11:20:05] well, apparently, "documented" is a strong word.. https://phabricator.wikimedia.org/T352838#9389837 [11:30:48] :D [12:32:56] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10BTullis) >>! In T352534#9396561, @xcollazo wrote: >>Report it as a bug upstream and point out that None should be a valid value for user_id > T... [12:39:35] 10Quarry, 10Data-Services, 10cloud-services-team (FY2023/2024-Q1-Q2): Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407 (10fnegri) @SD0001 not yet, sorry, too many other things! It's in my to-do list though! [13:16:45] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 4.299% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [13:27:20] 10Data-Platform-SRE: Restart Search Platform-owned services for Java 8 / Java 11 security updates - https://phabricator.wikimedia.org/T350703 (10MoritzMuehlenhoff) 05Resolved→03Open >>! In T350703#9372167, @bking wrote: > I believe this work is complete. Closing, but please reopen if we missed anything. The... [13:47:54] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10Ottomata) Q: would it be nicer to our future selves if we made a 'spark' user and keytab rather than a specific 'sp... [13:57:51] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10BTullis) [x] Created the `admin` user on the analytics instance: ` btullis@an-launcher1002:/srv/airflow-analytics$ sudo -u analytics airflow-an... [14:04:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [14:24:59] 10Data-Engineering, 10Data-Catalog: Emit lineage information about Airflow jobs to DataHub - https://phabricator.wikimedia.org/T312566 (10Ottomata) > automatic lineage instrumentation [...] We can't really use this because our logic is so removed from the airflow DAGs themselves. QQ, if we do it manually, d... [14:25:50] 10Data-Engineering, 10Data-Catalog: Emit lineage information about Airflow jobs to DataHub - https://phabricator.wikimedia.org/T312566 (10Ottomata) Anyway, +1 to this idea, sounds great and not that hard to do! > I'm in no way suggesting that we slow down on the work that @JAllemandou and @lbowmaker are leadi... [14:27:57] (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:59] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10observability: Change data platform-related IRC channels to improve communication - https://phabricator.wikimedia.org/T352783 (10Gehel) a:03bking [14:32:09] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Restart Search Platform-owned services for Java 8 / Java 11 security updates - https://phabricator.wikimedia.org/T350703 (10Gehel) [14:57:53] 10Data-Engineering, 10Data-Catalog: Emit lineage information about Airflow jobs to DataHub - https://phabricator.wikimedia.org/T312566 (10Ahoelzl) [15:09:56] 10Data-Engineering, 10CirrusSearch, 10Image-Suggestions, 10Structured-Data-Backlog, 10Discovery-Search (Current work): Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27 - https://phabricator.wikimedia.org/T353134 (10mfoss... [15:14:13] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10BTullis) I have sent an email to several related groups of users and I have updated wikitech here: https://wikitech.wikimedia.org/wiki/Data_Eng... [15:19:28] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [15:21:29] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Update relforge elasticsearch instance extra plugin - https://phabricator.wikimedia.org/T353270 (10CodeReviewBot) bking opened https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/57 elasticsearch: fix jar hell issue [15:21:47] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10BTullis) 05Open→03Resolved [15:24:47] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Update relforge elasticsearch instance extra plugin - https://phabricator.wikimedia.org/T353270 (10CodeReviewBot) bking merged https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/57 elasticsearch: fix jar hell issue [15:35:47] 10Data-Platform-SRE: Automate elastic plugin pkg build process - https://phabricator.wikimedia.org/T303011 (10bking) Reopening as: - I goofed up a package build yesterday - I'm 90% done with the playbook - Based on my discussion with @BTullis , the current CI-based build pipeline is [[ https://wikitech.wikime... [15:40:13] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10BTullis) p:05Triage→03High a:05brouberol→03BTullis >>! In T352838#9402813, @Ottomata wrote: > Q: would it b... [15:50:32] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Update elasticsearch instance extra plugin - https://phabricator.wikimedia.org/T353270 (10bking) [15:56:14] btullis: whenever you get a chance, coudl you please merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/980923 ? [15:58:16] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10BTullis) I've created this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/982846 to ad... [16:06:56] 10Data-Engineering, 10CirrusSearch, 10Image-Suggestions, 10Structured-Data-Backlog, 10Discovery-Search (Current work): Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27 - https://phabricator.wikimedia.org/T353134 (10mfoss... [16:37:20] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/22 Add our customisatio... [16:39:11] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10BTullis) I've made a good start with this, adding kerberos and memcached support to the superset container. Marking as ready for rev... [16:49:20] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Check home/HDFS leftovers of ntsako - https://phabricator.wikimedia.org/T343189 (10BTullis) 05Open→03Resolved I have received approval from @WDoranWMF to remove the files: Dropping hive databases. ` hive (default)> DROP DATABASE ntsako CASCADE; OK Time taken: 4.... [17:11:48] 10Analytics-Kanban, 10Data-Engineering, 10Pontoon: Move the Analytics/DE testing infrastructure to Pontoon - https://phabricator.wikimedia.org/T292388 (10BTullis) 05Open→03Declined I'm declining this task, as we haven't invested any more time into pontoon recently and seem unlikely to do so in the near f... [17:16:45] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 4.214% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:53:08] 10Data-Engineering, 10Data-Platform-SRE: Update the sqoop configuration for mediawiki to obtain linktarget from the production replicas, instead of wikireplicas - https://phabricator.wikimedia.org/T352879 (10Ladsgroup) 05Open→03Resolved This is fixed. Thanks Dan for finding the underlying issue. [18:24:21] 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10Growth-Team, and 10 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10matmarex) >>! In T350806#9401468, @Tgr wrote: > I think the more correct approach would be some kind... [18:27:58] (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:10:38] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Configure the YARN resource manager with the spark history service URL - https://phabricator.wikimedia.org/T352863 (10BTullis) This is proving a little tricky to test on the hadoop-test cluster, because we don't... [19:12:06] (03PS2) 10Milimetric: Remove grouping by unpredictable country name [analytics/refinery] - 10https://gerrit.wikimedia.org/r/982899 (https://phabricator.wikimedia.org/T353296) [21:16:45] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 3.995% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:25:43] (03PS1) 10Ottomata: Idea: Separate Iceberg writing logic from deequ -> wmf metrics [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/982919 [21:26:59] 10Data-Engineering, 10CirrusSearch, 10Image-Suggestions, 10Structured-Data-Backlog, 10Discovery-Search (Current work): Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27 - https://phabricator.wikimedia.org/T353134 (10dcaus... [22:27:58] (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:34:00] 10Data-Platform-SRE: Ensure Elastic stack works on bookworm - https://phabricator.wikimedia.org/T353392 (10bking) [22:34:41] 10Data-Platform-SRE: Ensure Elastic stack works on bookworm - https://phabricator.wikimedia.org/T353392 (10bking) [22:38:26] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10RKemper) Been seeing some weirdness on `elastic1107` (internal search team alerts for `PuppetZeroResources` and the like) so we'll see if a fresh reimage smooths things over [22:38:29] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm [22:38:55] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm executed with errors: - elastic1... [22:45:25] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm [22:45:28] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm executed with errors: - elastic1... [22:48:02] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm [22:50:39] 10Data-Platform-SRE: Rolling operation cookbook: Detect and remove failed index aliases - https://phabricator.wikimedia.org/T345449 (10RKemper) [22:52:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [23:11:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 90% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [23:21:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 90% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [23:22:00] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm completed: - elastic1107 (**PASS... [23:27:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [23:44:56] 10Data-Engineering, 10Data-Platform-SRE, 10SRE, 10observability, and 3 others: Upgrade Kafka to from 1.x to later version - https://phabricator.wikimedia.org/T300102 (10odimitrijevic) Thank you @elukey!