[01:16:44] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 4.675% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[02:25:49] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10BTullis) It took a while, but I've got past a real blocker that I had with publishing via the trusted runners. I've now got what I feel is a good vanilla...
[02:27:42] <jinxer-wm>	 (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:24:55] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Update relforge elasticsearch instance extra plugin - https://phabricator.wikimedia.org/T353270 (10CodeReviewBot) hashar merged https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/56  elasticsearch: fix changelog entries
[03:25:01] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Update relforge elasticsearch instance extra plugin - https://phabricator.wikimedia.org/T353270 (10CodeReviewBot) hashar opened https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/56  elasticsearch: fix changelog entries
[05:16:44] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 4.504% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[06:27:42] <jinxer-wm>	 (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:27:50] <wikibugs>	 10Quarry, 10Data-Services, 10cloud-services-team (FY2023/2024-Q1-Q2): Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407 (10SD0001) Hi @fnegri, did you get a chance to get to this? Thanks!
[08:59:27] <wikibugs>	 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10Growth-Team, and 9 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10Krinkle) a:03xSavitar
[09:01:07] <wikibugs>	 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10Growth-Team, and 9 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10Krinkle) a:05xSavitar→03DAlangi_WMF
[09:01:21] <wikibugs>	 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10Growth-Team, and 9 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10Krinkle)
[09:16:44] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 4.305% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[09:42:53] <klausman>	 joal: just a heads up: I am merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/982761 soon, which will add newer firmware for AMD GPUs for all thus-enabled nodes (it's a bullseye backport for Bullseye machines only), since we need the newer package for the newer GPUs on Lift Wing. Shouldn't break anything for the older Vega GPUs.
[09:43:43] <klausman>	 Correction, https://gerrit.wikimedia.org/r/c/operations/puppet/+/982766
[10:00:03] <wikibugs>	 10Data-Platform-SRE (23/24 Q3 Milestone 1): Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 (10Gehel)
[10:00:37] <wikibugs>	 10Data-Platform-SRE (23/24 Q3 Milestone 1): Upgrade Stats clients to bullseye - https://phabricator.wikimedia.org/T329360 (10Gehel)
[10:00:50] <wikibugs>	 10Data-Platform-SRE (23/24 Q3 Milestone 1): Upgrade hadoop master to bullseye - https://phabricator.wikimedia.org/T332573 (10Gehel)
[10:01:02] <wikibugs>	 10Data-Platform-SRE (23/24 Q3 Milestone 1): Upgrade hadoop standby master to bullseye - https://phabricator.wikimedia.org/T332578 (10Gehel)
[10:01:32] <wikibugs>	 10Data-Platform-SRE (23/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10Gehel)
[10:01:47] <wikibugs>	 10Data-Platform-SRE (23/24 Q3 Milestone 1): Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10Gehel)
[10:16:36] <wikibugs>	 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10BTullis) Getting there. This is from an SSH port forwarding session in the test cluster.{F41598247,width=50%}  We're currently investigating who only four applications...
[10:24:10] <elukey>	 brouberol: o/
[10:24:45] <elukey>	 I saw that https://phabricator.wikimedia.org/T351816 is closed, are we using the spark-history keytab atm or the analytics one?
[10:25:34] <elukey>	 I am asking since I think it would be good now to test the right spark-history keytab 
[10:25:38] <elukey>	 to make sure it works fine etc..
[10:26:09] <btullis>	 elukey: We're using spark-history ones, I believe. https://phabricator.wikimedia.org/T351816#9364086
[10:26:56] <elukey>	 btullis: ack, I recall that the analytics one was mentioned during a meeting, this is why I am asking
[10:27:43] <jinxer-wm>	 (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:32:42] <elukey>	 btullis: the base64 -d of the keytab returns
[10:32:52] <elukey>	 WIKIMEDIAanalytics"spark-history-test.svc.eqiad.wmneteq
[10:33:05] <elukey>	 IIRC there was an issue with access permissions on hdfs
[10:33:14] <elukey>	 spark-history needs to be able to fetch data etc..
[10:33:45] <elukey>	 with superset we added it a super user IIRC, something that was able to proxy auth permissions 
[10:33:51] <elukey>	 I am not sure if we need something similar
[10:34:16] <elukey>	 but spark history shouldn't run as "analytics" in my opinion
[10:34:37] <btullis>	 Oh yes, I remember the context. You must be right, we were going to add the spark-history posix user. I probably answered too soon.
[10:36:18] <elukey>	 yes yes don't worry I don't want to intrude, I was just curious, there are soo many inflight changes
[10:37:21] <wikibugs>	 10Data-Engineering, 10CirrusSearch, 10Image-Suggestions, 10Structured-Data-Backlog, 10Discovery-Search (Current work): Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27 - https://phabricator.wikimedia.org/T353134 (10mfoss...
[10:37:39] <btullis>	 I think that the keytabs have been created, but we might still be using analytics for the current test that is running manually. I'll let b.rouberol give you a more definitive answer.
[10:41:08] <brouberol>	 elukey: the ticket you mentioned is closed, but I re-opened https://phabricator.wikimedia.org/T352838#9390972 to make sure we track that we should create the posix spark-history user. I didn't think to re-open the other one as well
[10:41:56] <elukey>	 ack!
[10:42:18] <brouberol>	 To avoid mixing many different things, I left it as is atm to focus on the chart/helmfiles development (all working now thanks to you, jayme and btullis), so this will be back on my plate very soon!
[10:42:53] <elukey>	 yes yes! I wanted to add some info before I will be afk, in profile::hadoop::common there are hadoop.proxyuser.etc..
[10:43:05] <elukey>	 I am not sure if it will be the right road for spark history
[10:43:15] <elukey>	 but the permissions problem will like resurface
[10:44:21] <elukey>	 (proxyuser allows a specific user to impersonate another one)
[11:00:15] <moritzm>	 I'm currently doing a pass over hosts in insetup* roles to identify any Puppet5/Puppet7 gotchas (like insetup roles defaulting to Puppet 7 which then get a role applied which doesn't default to Puppet 5 yet, causing cert issues)
[11:00:28] <moritzm>	 for Data Engineering there's two things I'd do:
[11:01:48] <moritzm>	 1. for stat1010/stat1010 I'll add Hiera host entries to force them to Puppet 7 (they currently already on Puppet 7 due to the insetup role), but the stat* role isn't yet since we still have buster nodes. Those Hiera host entries will ensure the service setup will cotinue to use Puppet 7 (we already have stat1009 on Puppet 7)
[11:03:53] <moritzm>	 2. there are new kafka-stretch* hosts, I suppose these will use a dedicated kafka role eventually, like role::kafka::stretch? then I'd create an initial stub role (what is also provided by the insetup roles like profile::firewall etc and have that default to Puppet via Hiera role settings, also ensuring that the setup sticks with Puppet 7
[11:05:05] <brouberol>	 we tried using proxy user, but apparently spark history itself does not support it
[11:05:14] <moritzm>	 the other insetup hosts are fine, they are for roles which already default to Puppet 7
[11:13:26] <wikibugs>	 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Run a spark job in test to make sure the history server can see the job data - https://phabricator.wikimedia.org/T352882 (10BTullis) In case it helps, there are some useful resources in [[https://docs.google.com/presentation/d/1P6XPqL...
[11:14:49] <elukey>	 brouberol: setting it and restarting hdfs namenodes etc..?
[11:17:06] <brouberol>	 yep, I tried that for the test analytics cluster. That apparently only work if the application performs impersonation (the `doAs` call in https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Superusers.html)
[11:17:07] <btullis>	 moritzm: Thanks so much. For the stat101[0-1] servers that's great. It looks like the kafka-stretch cluster isn't going ahead, so these servers will likely be repurposed. https://phabricator.wikimedia.org/T340492#9390367 - So they can stay in insetup until we decom->rename->recom them, I think.
[11:18:11] <brouberol>	 elukey: I tested adding some proxy users for the test analytics cluster, ran the rolling-restart cookbook for the namenodes, to take the new config into account, ran the spark hsitory server with the proxy user, and I was still getting HDFS permission errors
[11:18:56] <brouberol>	 I have documented that somewhere in Phab, let me have a look
[11:19:54] <moritzm>	 btullis: ack, sounds good
[11:20:05] <brouberol>	 well, apparently, "documented" is a strong word.. https://phabricator.wikimedia.org/T352838#9389837
[11:30:48] <elukey>	 :D
[12:32:56] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1): [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10BTullis) >>! In T352534#9396561, @xcollazo wrote: >>Report it as a bug upstream and point out that None should be a valid value for user_id > T...
[12:39:35] <wikibugs>	 10Quarry, 10Data-Services, 10cloud-services-team (FY2023/2024-Q1-Q2): Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407 (10fnegri) @SD0001 not yet, sorry, too many other things! It's in my to-do list though!
[13:16:45] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 4.299% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[13:27:20] <wikibugs>	 10Data-Platform-SRE: Restart Search Platform-owned services for Java 8 / Java 11 security updates - https://phabricator.wikimedia.org/T350703 (10MoritzMuehlenhoff) 05Resolved→03Open >>! In T350703#9372167, @bking wrote: > I believe this work is complete. Closing, but please reopen if we missed anything.  The...
[13:47:54] <wikibugs>	 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10Ottomata) Q: would it be nicer to our future selves if we made a 'spark' user and keytab rather than a specific 'sp...
[13:57:51] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1): [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10BTullis) [x] Created the `admin` user on the analytics instance: ` btullis@an-launcher1002:/srv/airflow-analytics$ sudo -u analytics airflow-an...
[14:04:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[14:24:59] <wikibugs>	 10Data-Engineering, 10Data-Catalog: Emit lineage information about Airflow jobs to DataHub - https://phabricator.wikimedia.org/T312566 (10Ottomata) > automatic lineage instrumentation  [...] We can't really use this because our logic is so removed from the airflow DAGs themselves.   QQ, if we do it manually, d...
[14:25:50] <wikibugs>	 10Data-Engineering, 10Data-Catalog: Emit lineage information about Airflow jobs to DataHub - https://phabricator.wikimedia.org/T312566 (10Ottomata) Anyway, +1 to this idea, sounds great and not that hard to do!  > I'm in no way suggesting that we slow down on the work that @JAllemandou and @lbowmaker are leadi...
[14:27:57] <jinxer-wm>	 (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:29:59] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10observability: Change data platform-related IRC channels to improve communication - https://phabricator.wikimedia.org/T352783 (10Gehel) a:03bking
[14:32:09] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Restart Search Platform-owned services for Java 8 / Java 11 security updates - https://phabricator.wikimedia.org/T350703 (10Gehel)
[14:57:53] <wikibugs>	 10Data-Engineering, 10Data-Catalog: Emit lineage information about Airflow jobs to DataHub - https://phabricator.wikimedia.org/T312566 (10Ahoelzl)
[15:09:56] <wikibugs>	 10Data-Engineering, 10CirrusSearch, 10Image-Suggestions, 10Structured-Data-Backlog, 10Discovery-Search (Current work): Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27 - https://phabricator.wikimedia.org/T353134 (10mfoss...
[15:14:13] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1): [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10BTullis) I have sent an email to several related groups of users and I have updated wikitech here: https://wikitech.wikimedia.org/wiki/Data_Eng...
[15:19:28] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[15:21:29] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Update relforge elasticsearch instance extra plugin - https://phabricator.wikimedia.org/T353270 (10CodeReviewBot) bking opened https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/57  elasticsearch: fix jar hell issue
[15:21:47] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1): [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10BTullis) 05Open→03Resolved
[15:24:47] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Update relforge elasticsearch instance extra plugin - https://phabricator.wikimedia.org/T353270 (10CodeReviewBot) bking merged https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/57  elasticsearch: fix jar hell issue
[15:35:47] <wikibugs>	 10Data-Platform-SRE: Automate elastic plugin pkg build process - https://phabricator.wikimedia.org/T303011 (10bking) Reopening as:   - I goofed up a package build yesterday - I'm 90% done with the playbook -  Based on my discussion with @BTullis , the current CI-based build pipeline is [[ https://wikitech.wikime...
[15:40:13] <wikibugs>	 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10BTullis) p:05Triage→03High a:05brouberol→03BTullis >>! In T352838#9402813, @Ottomata wrote: > Q: would it b...
[15:50:32] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Update elasticsearch instance extra plugin - https://phabricator.wikimedia.org/T353270 (10bking)
[15:56:14] <xcollazo>	 btullis: whenever you get a chance, coudl you please merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/980923 ?
[15:58:16] <wikibugs>	 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10BTullis) I've created this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/982846 to ad...
[16:06:56] <wikibugs>	 10Data-Engineering, 10CirrusSearch, 10Image-Suggestions, 10Structured-Data-Backlog, 10Discovery-Search (Current work): Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27 - https://phabricator.wikimedia.org/T353134 (10mfoss...
[16:37:20] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/22  Add our customisatio...
[16:39:11] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10BTullis) I've made a good start with this, adding kerberos and memcached support to the superset container. Marking as ready for rev...
[16:49:20] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Check home/HDFS leftovers of ntsako - https://phabricator.wikimedia.org/T343189 (10BTullis) 05Open→03Resolved I have received approval from @WDoranWMF to remove the files:  Dropping hive databases. ` hive (default)> DROP DATABASE ntsako CASCADE; OK Time taken: 4....
[17:11:48] <wikibugs>	 10Analytics-Kanban, 10Data-Engineering, 10Pontoon: Move the Analytics/DE testing infrastructure to Pontoon - https://phabricator.wikimedia.org/T292388 (10BTullis) 05Open→03Declined I'm declining this task, as we haven't invested any more time into pontoon recently and seem unlikely to do so in the near f...
[17:16:45] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 4.214% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[17:53:08] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Update the sqoop configuration for mediawiki to obtain linktarget from the production replicas, instead of wikireplicas - https://phabricator.wikimedia.org/T352879 (10Ladsgroup) 05Open→03Resolved This is fixed. Thanks Dan for finding the underlying issue.
[18:24:21] <wikibugs>	 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10Growth-Team, and 10 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10matmarex) >>! In T350806#9401468, @Tgr wrote: > I think the more correct approach would be some kind...
[18:27:58] <jinxer-wm>	 (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:10:38] <wikibugs>	 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Configure the YARN resource manager with the spark history service URL - https://phabricator.wikimedia.org/T352863 (10BTullis) This is proving a little tricky to test on the hadoop-test cluster, because we don't...
[19:12:06] <wikibugs>	 (03PS2) 10Milimetric: Remove grouping by unpredictable country name [analytics/refinery] - 10https://gerrit.wikimedia.org/r/982899 (https://phabricator.wikimedia.org/T353296)
[21:16:45] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 3.995% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[21:25:43] <wikibugs>	 (03PS1) 10Ottomata: Idea: Separate Iceberg writing logic from deequ -> wmf metrics [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/982919
[21:26:59] <wikibugs>	 10Data-Engineering, 10CirrusSearch, 10Image-Suggestions, 10Structured-Data-Backlog, 10Discovery-Search (Current work): Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27 - https://phabricator.wikimedia.org/T353134 (10dcaus...
[22:27:58] <jinxer-wm>	 (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:34:00] <wikibugs>	 10Data-Platform-SRE: Ensure Elastic stack works on bookworm - https://phabricator.wikimedia.org/T353392 (10bking)
[22:34:41] <wikibugs>	 10Data-Platform-SRE: Ensure Elastic stack works on bookworm - https://phabricator.wikimedia.org/T353392 (10bking)
[22:38:26] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10RKemper) Been seeing some weirdness on `elastic1107` (internal search team alerts for `PuppetZeroResources` and the like) so we'll see if a fresh reimage smooths things over
[22:38:29] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm
[22:38:55] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm executed with errors: - elastic1...
[22:45:25] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm
[22:45:28] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm executed with errors: - elastic1...
[22:48:02] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm
[22:50:39] <wikibugs>	 10Data-Platform-SRE: Rolling operation cookbook: Detect and remove failed index aliases - https://phabricator.wikimedia.org/T345449 (10RKemper)
[22:52:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[23:11:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 90% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[23:21:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 90% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[23:22:00] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm completed: - elastic1107 (**PASS...
[23:27:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[23:44:56] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10SRE, 10observability, and 3 others: Upgrade Kafka to from 1.x to later version - https://phabricator.wikimedia.org/T300102 (10odimitrijevic) Thank you @elukey!