[00:07:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:10:12] <wikibugs>	 (03CR) 10Cooltey: [C: 03+1] New Event schema for mobile apps [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960100 (owner: 10Sharvaniharan)
[00:12:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:18:49] <wikibugs>	 (03CR) 10Shay Nowick: [C: 03+2] New Event schema for mobile apps (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960100 (owner: 10Sharvaniharan)
[00:18:54] <icinga-wm>	 RECOVERY - Druid overlord on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[00:19:49] <wikibugs>	 (03Merged) 10jenkins-bot: New Event schema for mobile apps [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960100 (owner: 10Sharvaniharan)
[00:22:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:42:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:52:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:12:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:21:43] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-libs-HTTP, 10Beta-Cluster-reproducible, and 6 others: PHP Warning: curl_multi_remove_handle(): supplied resource is not a valid cURL Multi Handle resource - https://phabricator.wikimedia.org/T288624 (10tstarling) 05Open→03Resolv...
[01:22:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:42:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:52:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:07:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:12:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:22:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:27:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:35:06] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[02:37:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:42:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:45:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[02:52:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:12:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:22:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:37:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:42:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:52:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:12:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:22:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:37:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:42:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:52:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:12:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:22:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:42:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:52:30] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1bb0fee0-dd0c-4165-86e2-b81abeffa7d2) set by stevemunene@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with...
[06:00:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[06:17:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:18:58] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:31:16] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:32:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:35:06] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[07:35:28] <joal>	 stevemunene: Heya - it'd be great if could silence alerts when reimaging or inirializing hosts - druid1009 is spamming us badly :)
[07:38:01] <stevemunene>	 o/ Apologies for the spam joal , I have already ran the sre.hosts.downtime cookbook for druid1009 so we shouldn't expect anymore spam
[07:38:14] <joal>	 ack stevemunene - thanks for that
[08:02:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:03:59] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:15:35] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:17:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:31:49] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: Add fon.wikipedia to pageview allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/962234 (https://phabricator.wikimedia.org/T347939)
[08:32:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:32:57] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:34:15] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Now that the incident has been mitigated, I'll resume reassigning partitions away from the kafka-jumbo100[1-6] brokers.   ` brouberol@kafka-jumbo1010:~/topicmappr$ topicmappr rebuild  --topics '^(-l|-L...
[08:40:41] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` brouberol@kafka-jumbo1010:~/topicmappr$ kafka reassign-partitions --reassignment-json-file small-or-empty.json --verify kafka-reassign-partitions --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet...
[08:45:29] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:47:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:57:47] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10elukey) >>! In T347477#9216774, @Ottomata wrote: > WOW thank you Luca!  Happy to help! I wanted to understand how to...
[08:58:05] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Next batch of reassignment: ` $ topicmappr rebuild  --topics '^(edisa.mediawiki.job.xxx|edisa.mediawiki.jobrefreshLinks|elukey_druid_test|ema_test_ats|eqcodfw.mediawiki.revision-create|eqcodfw.rc1.medi...
[08:59:27] <wikibugs>	 (03PS1) 10DCausse: rdf_streaming_updater: add emitter_id to side outputs [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/963006 (https://phabricator.wikimedia.org/T347515)
[09:30:14] <brouberol>	 btullis: I'm not 100% sure about the leader imbalance metric, but as you can see here https://grafana-rw.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=thanos&var-kafka_cluster=jumbo-eqiad&var-cluster=kafka_jumbo&var-kafka_broker=All&var-disk_device=All&forceLogin=&from=now-7d&to=now&viewPanel=20 the leadership imbalance between all brokers
[09:30:14] <brouberol>	 is being reduced after each reassignment
[09:32:32] <joal>	 \o/
[09:33:28] <joal>	 brouberol: I need to introduce you to ottomata - he's back from paternity leave and is our kafka go-to person (before you were there :) - I'm sure you'll have things to talk about :)
[09:34:35] <brouberol>	 oh for sure! I'll book some time
[09:35:00] <joal>	 brouberol: on a different topic - we think we have identified why our flink app was not backfilling to a rate we're happy with, and we managed to overcome the issue by adding more parallelization - The not-yet-understood issue is why it broke in the first place, not managing to write to kafka
[09:36:21] <brouberol>	 indeed. This had me stumped as well. If you want, we could try to setup a second app writing to a topic with a small enough retention so that it is a couple of GB, and we them move it around to try to reproduce
[09:36:47] <brouberol>	 but for now, either we haven't had the epiphany idea, or we don't have a metric on the actual bottleneck
[09:37:31] <brouberol>	 do you have metrics on the size of the batch you're attempting to produce? My spidey sense tells me that somehow, this is related to message size (I could 100% be wrong)
[09:37:36] <joal>	 let's talk about what we wish to do with the rest of the team before deciding - the trials are not cheap unfortunately and we have many other things to do :)
[09:37:54] <brouberol>	 for sure 
[09:38:21] <joal>	 I don't have this innformation in mind (batch size), but the weird thing is that the app started working after the partition-reassignment was done
[09:44:42] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3): Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10JAllemandou) Ping @BTullis on this - Could we get apriorization on either a presto version bump or an adaptation of our version...
[09:48:57] <wikibugs>	 10Data-Platform-SRE, 10decommission-hardware: decommission an-test-client1001.eqiad.wmnet - https://phabricator.wikimedia.org/T343520 (10Stevemunene)
[09:49:18] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for next deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/962234 (https://phabricator.wikimedia.org/T347939) (owner: 10Gerrit maintenance bot)
[09:49:31] <wikibugs>	 10Data-Platform-SRE, 10decommission-hardware: decommission an-test-client1001.eqiad.wmnet - https://phabricator.wikimedia.org/T343520 (10Stevemunene) 05Open→03Resolved
[09:49:39] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10Stevemunene)
[09:50:09] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3): Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10BTullis) Yes, will do. {T342343} is already in our prioritized backlog, so I'm hoping to get version 0.283 of presto deployed v...
[09:51:27] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3): Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10JAllemandou) >>! In T266641#9219175, @BTullis wrote: > Yes, will do. {T342343} is already in our prioritized backlog, so I'm ho...
[09:54:18] <dcausse>	 fyi I just resumed our flink test job running from dse k8s (reading and writing to kafka-jumber)
[09:54:26] <dcausse>	 s/jumber/jumbo
[09:56:57] <joal>	 ack dcausse - does it have a dedicated topic to write it's reconciliation to?
[09:58:47] <dcausse>	 joal: kind of, added a quick option to disable producing to these topics (working on a more generic solution in the meantime)
[09:59:33] <joal>	 ack dcausse - no problem at all, just trying to keep an up-to-date mental model :)
[10:25:29] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene)
[10:35:06] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[10:36:00] <btullis>	 ^ I will bump the value in the alert for this check. I increased the heap by 4GB, but haven't modified the threshold in the check yet.
[10:38:34] <wikibugs>	 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) Here's the full list of IP addresses that are being modified. ` (base) btullis@marlin:~/tmp$ cat clouddumps1...
[11:35:52] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Next batch  rouberol@kafka-jumbo1010:~/topicmappr$ topicmappr rebuild  --topics '^(eqiad.android.image_recommendation_event|eqiad.android.image_recommendation_interaction|eqiad.android.install_referrer...
[11:38:55] <joal>	 Hi btullis - for people to be able to ssh deployment.eqiad.wmnet, they need to be in the analytics-deploy group, right?
[11:40:34] <joal>	 We have Surbhi Gupta  now doing ops-week, who appears not to have the right :( (user sg912)
[11:40:55] <joal>	 Is this change something we can do quickly, or do you need a formal demand btullis ?
[11:44:22] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Next batch  ` brouberol@kafka-jumbo1010:~/topicmappr$ topicmappr rebuild  --topics '^(eqiad.change-prop.retry.mediawiki.job.flaggedrevs_CacheUpdate|eqiad.change-prop.retry.mediawiki.job.htmlCacheUpdate...
[11:58:15] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Next batch ` topicmappr rebuild  --topics '^(eqiad.changeprop.retry.mediawiki.revision-create|eqiad.changeprop.retry.mediawiki.revision-visibility-change|eqiad.changeprop.retry.resource_change|eqiad.ci...
[12:03:35] <btullis>	 joal: Oh, sorry. I've just seen your message.
[12:04:39] <joal>	 np btullis 
[12:04:47] <joal>	 so, how shall we do btullis ?
[12:05:20] <btullis>	 sgupta would have to be in any one of these groups to ssh to a deployment server. https://github.com/wikimedia/operations-puppet/blob/production/hieradata/role/common/deployment_server/kubernetes.yaml#L8-L28
[12:06:25] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10netops, and 2 others: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10JAllemandou)
[12:06:51] <btullis>	 Let me check, perhaps analytics-deployers is right ofr ops week duties, or perhaps analytics-admins.
[12:07:04] <joal>	 makes sense btullis 
[12:07:18] <joal>	 I'd go fpr analytics-deployers, but you know best :)
[12:08:24] <joal>	 Also btullis, sgupta should be given the right to send PRs to the puppet repo, in order for her to update the AQS druid snapshot
[12:08:41] <joal>	 (sorry to bother about this btullis  :S - Once done, it'll be over for this person)
[12:08:50] <btullis>	 I think it's analytics_admins because members of that group are then made members of analytics-deployers here: https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/data/data.yaml#L897-L900
[12:09:01] <btullis>	 It's no bother. I'll sort it now.
[12:09:08] <joal>	 thanks a milion :)
[12:11:06] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Next batch  ` topicmappr rebuild  --topics '^(eqiad.eventgate-main.test.event|eqiad.eventlogging_ContentTranslationAbuseFilter|eqiad.eventlogging_EditAttemptStep|eqiad.ios.edit_history_compare|eqiad.io...
[12:14:41] <btullis>	 joal: Would you mind adding a comment to here please, stating the requirement re ops week? https://phabricator.wikimedia.org/T335657
[12:15:18] <btullis>	 I've got rights to approve the group membership, but it would be good to have some record of the request.
[12:16:32] <joal>	 btullis: https://phabricator.wikimedia.org/T335657#9219848
[12:22:21] <btullis>	 Puppet patch ready. https://gerrit.wikimedia.org/r/c/operations/puppet/+/963022
[12:25:57] <wikibugs>	 10Data-Platform-SRE, 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) Thanks!  `lang=shell-session taavi@cloudcontrol1006 ~ $ os server list --long --all-projects --host cloudvirt-wdqs1001 +---------------------------------...
[12:28:15] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, 10Patch-For-Review: mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10CodeReviewBot) otto merged https://gitlab.wikimedia.org/repos/data-en...
[12:30:44] <joal>	 btullis: sorry to bother again - can you add sg912 to the puppet gerrit repo?
[12:32:10] <btullis>	 I'm doing it as we speak. I believe that it is the wmf LDAP group that I mentioned here: https://phabricator.wikimedia.org/T335657#9186606 - I thought someone else picked it up after I mentioned it, but it seems not. Just doing double-checks.
[12:34:36] <joal>	 <3
[12:37:29] <btullis>	 `wmf` LDAP group done (that should also allow Grafana login/edit, which was also requested) - puppet patch merged, Puppet running on deploy2002.codfw.wmnet now.
[12:38:54] <wikibugs>	 10Data-Platform-SRE: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10brouberol) @Ottomata can you confirm that I can delete these topics? They all have RF=1:  ` brouberol@kafka-jumbo1010:~/topicmappr$ kafka topics --describe | grep 'ReplicationFactor:1' | grep -...
[12:40:41] <btullis>	 https://www.irccloud.com/pastebin/UQjjPRqT/
[12:41:22] <btullis>	 joal: Please feel free to try the deploy again and log out/in of gerrit to pick up the new privileges.
[12:41:38] <joal>	 sure
[12:42:39] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Reassign partitions away from kafka-jumbo100[1-6] to kafka-jumbo10[07-15] brokers - https://phabricator.wikimedia.org/T346425 (10brouberol)
[12:43:40] <wikibugs>	 10Data-Platform-SRE: Install kafka-kit binaries on kafka brokers - https://phabricator.wikimedia.org/T346764 (10brouberol) 05Open→03Resolved
[12:43:43] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Reassign partitions away from kafka-jumbo100[1-6] to kafka-jumbo10[07-15] brokers - https://phabricator.wikimedia.org/T346425 (10brouberol)
[12:43:48] <wikibugs>	 10Data-Platform-SRE: Package kafka-kit binaries (topicmappr, metricsfetcher, ...) as a debian-package - https://phabricator.wikimedia.org/T346763 (10brouberol) 05Open→03Resolved
[12:43:52] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Reassign partitions away from kafka-jumbo100[1-6] to kafka-jumbo10[07-15] brokers - https://phabricator.wikimedia.org/T346425 (10brouberol)
[12:48:00] <jinxer-wm>	 (SystemdUnitFailed) firing: wikidatardf-truthy-dumps.service Failed on snapshot1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:02:01] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10Gehel)
[13:03:33] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10dcausse) @bking only one host has to be loaded with the full dataset. The loading process can be started as soon as possible but there are few constr...
[13:10:56] <wikibugs>	 10Data-Engineering, 10EventStreams, 10Event-Platform: Make eventgate-analytics-external the default event service - https://phabricator.wikimedia.org/T342610 (10phuedx) >>! In T342610#9216725, @Ottomata wrote: > Is this just so it doesn't have to be set manually for analytics devs creating streams?  Yes.   >...
[13:11:49] <gehel>	 btullis (and others): I wrote a description for our Superset goal (https://docs.google.com/document/d/1dfz-aeKFRMlYJEtzhNjoIruxMNsnlMbAE4NNe4u8ZIw/edit). If you could review and comment...
[13:12:19] <btullis>	 gehel: Great, thanks. Will do.
[13:18:11] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Investigate recent CirrusSearch p95 latency - https://phabricator.wikimedia.org/T347988 (10bking)
[13:18:46] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Investigate recent CirrusSearch p95 latency - https://phabricator.wikimedia.org/T347988 (10bking)
[13:19:43] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10dr0ptp4kt) For me the first 300 GB of the file went really, really fast. But `axel` was dropping connections, similar to when I had downloaded t...
[13:23:28] <wikibugs>	 10Data-Platform-SRE, 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by taavi@cumin1001 for hosts: `cloudvirt-wdqs1002.eqiad.wmnet` - cloudvirt-wdqs1002.eqiad.wmnet (*...
[13:32:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:33:13] <wikibugs>	 10Data-Platform-SRE: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10brouberol) And more broadly, are you using any of these topics? ` brouberol@kafka-jumbo1010:~/topicmappr$ kafka topics --describe | grep -i otto | awk '{ print $1 }' | sed 's/Topic://' | grep -...
[13:33:22] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:34:12] <wikibugs>	 10Data-Engineering, 10EventStreams, 10Event-Platform: Make eventgate-analytics-external the default event service - https://phabricator.wikimedia.org/T342610 (10Ottomata) > Does anything fail if a developer tries to create a stream without destination_event_service set?  Nothing explicitly, but there won't b...
[13:34:19] <wikibugs>	 10Data-Platform-SRE, 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by taavi@cumin1001 for hosts: `cloudvirt-wdqs1003.eqiad.wmnet` - cloudvirt-wdqs1003.eqiad.wmnet (*...
[13:36:52] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE-Access-Requests, 10Event-Platform: Add Antoine_Quhen to the deployment group  - https://phabricator.wikimedia.org/T347296 (10Ottomata)
[13:37:17] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE-Access-Requests, 10Event-Platform: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Ottomata) Updated description and tagged #sre-access-requests
[13:38:41] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Next reassignment will only span `eqiad.mediawiki.api-request` as it's a large topic.  ` brouberol@kafka-jumbo1010:~/topicmappr$ topicmappr rebuild  --topics 'eqiad.mediawiki.api-request' --brokers 100...
[13:39:52] <wikibugs>	 10Data-Engineering, 10Event-Platform, 10MW-1.41-notes (1.41.0-wmf.29; 2023-10-03), 10Platform Team Initiatives (New Hook System): Update EventBus to use the new HookContainer/HookRunner system - https://phabricator.wikimedia.org/T346539 (10Ottomata) @Umherirrender is CentralNotice part of this task?  If no...
[13:43:40] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye
[13:46:12] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:46:42] <wikibugs>	 10Data-Engineering, 10EventStreams, 10Event-Platform: Make eventgate-analytics-external the default event service - https://phabricator.wikimedia.org/T342610 (10phuedx) 05Open→03Resolved a:03phuedx Being **bold**. In this case it's better to be explicit.  Maybe we could write a test to check that the p...
[13:47:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:48:06] <wikibugs>	 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) @jbond - Could I ask you to cast your eye over this (T346165#9219409) please, if you have a little time.  We...
[13:48:42] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10bking)
[13:49:17] <brouberol>	 btullis is there a way to proactively mute an alert in alertmanager? My understanding is that if the alert does not yet fire, it's not visible in alertmanager. 
[13:49:49] <wikibugs>	 10Data-Platform-SRE, 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by taavi@cumin1001 for hosts: `cloudvirt-wdqs1001.eqiad.wmnet` - cloudvirt-wdqs1001.eqiad.wmnet (*...
[13:51:17] <btullis>	 brouberol: I think you can do it with label matching from the top-right corner https://usercontent.irccloud-cdn.com/file/XLXDMEyC/image.png
[13:51:42] <btullis>	 Or from the CLI: https://wikitech.wikimedia.org/wiki/Alertmanager#Add_a_silence_via_CLI
[13:51:58] <btullis>	 ...but I confess I've not done either much.
[13:52:57] <wikibugs>	 (03PS14) 10Milimetric: Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu)
[13:53:23] <brouberol>	 Thanks! Using the first method I was able to browse previous expired silences and re-create one
[13:55:07] <wikibugs>	 (03CR) 10Milimetric: "ooh, 14's my lucky number.  This is being merged with a couple of caveats:" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu)
[13:55:08] <wikibugs>	 10Data-Engineering, 10Event-Platform, 10MW-1.41-notes (1.41.0-wmf.29; 2023-10-03), 10Platform Team Initiatives (New Hook System): Update EventBus to use the new HookContainer/HookRunner system - https://phabricator.wikimedia.org/T346539 (10Umherirrender) a:03Umherirrender >>! In T346539#9220207, @Ottomat...
[13:55:21] <wikibugs>	 (03CR) 10Milimetric: [C: 03+2] Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu)
[13:55:26] <wikibugs>	 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team, 10Patch-For-Review: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10jbond) @BTullis from a quick look the resources that is changing is the ferm::service resource...
[13:58:07] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10Ottomata) Wow, before deploying this, we were backfilling with 20 replicas and doing ~100 m...
[13:59:50] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye
[14:05:32] <wikibugs>	 (03Merged) 10jenkins-bot: Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu)
[14:15:55] <wikibugs>	 10Data-Platform-SRE: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10Ottomata) Please delete all of those!   Thanks.
[14:16:27] <wikibugs>	 10Data-Platform-SRE: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10Ottomata) You can also probably delete ANY topic that has ksql in it.  We've never used KSQL in prod.
[14:31:08] <wikibugs>	 10Data-Engineering, 10EventStreams, 10Event-Platform: Make eventgate-analytics-external the default event service - https://phabricator.wikimedia.org/T342610 (10Ottomata) > Maybe we could write a test to check that the property is present on all streams?  It doesn't need to be present on all streams, since n...
[14:35:06] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[14:37:01] <wikibugs>	 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team, 10Patch-For-Review: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) >>! In T346165#9220435, @jbond wrote: > @BTullis from a quick look the resources that...
[14:38:32] <wikibugs>	 10Data-Engineering, 10Event-Platform, 10MW-1.41-notes (1.41.0-wmf.29; 2023-10-03), 10Patch-For-Review, 10Platform Team Initiatives (New Hook System): Update EventBus to use the new HookContainer/HookRunner system - https://phabricator.wikimedia.org/T346539 (10Ottomata) Oh  right! I had forgotten those wh...
[14:42:46] <wikibugs>	 10Data-Engineering, 10EventStreams, 10Event-Platform: Make eventgate-analytics-external the default event service - https://phabricator.wikimedia.org/T342610 (10phuedx) TIL
[14:46:47] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform, 10Patch-For-Review: Enable canary events for all streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) ^ oops, wrong task.
[15:12:59] <wikibugs>	 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team, 10Patch-For-Review: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BBlack) We could add some normalization function at the ferm or puppet-dns-lookup layer perhaps...
[15:20:03] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed wi...
[15:33:45] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Event-Platform: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676 (10Ottomata)
[15:37:47] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed wi...
[15:43:17] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [TEMPLATE] Onboard request for APPLICATION NAME to Event Platform - https://phabricator.wikimedia.org/T346207 (10Ottomata) Can this be declined?  Or is this the actual Template?
[15:46:30] <wikibugs>	 10Data-Engineering, 10EventStreams, 10Event-Platform: Event streams don't respect milliseconds UTC unix epoch timestamp in since parameter - https://phabricator.wikimedia.org/T345606 (10Ottomata) Thanks for the report!  Getting it into triage...
[15:49:52] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye
[15:49:56] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye
[15:57:43] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Five deleted Wikidata items pertaining to Wikimedia category pages still present in the Query Service - https://phabricator.wikimedia.org/T342593 (10Ottomata) > I agree with @Milimetric here...
[16:03:11] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Rename and release stream as mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T341783 (10Ottomata) 05Open→03Resolved a:03Ottomata I believe this is already done. Resolving.
[16:05:00] <wikibugs>	 10Data-Engineering, 10Event-Platform: mediawiki-event-enrichment deployment process should include producing an event in staging and verifying success - https://phabricator.wikimedia.org/T341138 (10Ottomata) This is a great idea.  EventGate does this for its readiness probe.  We can do this in prod too, but pr...
[16:05:49] <wikibugs>	 10Data-Engineering, 10Event-Platform: mediawiki-event-enrichment deployment process should include producing an event in staging and verifying success - https://phabricator.wikimedia.org/T341138 (10Ottomata) Oh wait this filed by past me.  Great idea past me: doh.
[16:07:04] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: eventutilities-python EventProcessFunction throws NPE if user func returns None - https://phabricator.wikimedia.org/T335706 (10Ottomata) 05Open→03Resolved
[16:12:09] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: flink-app: swift bucket and zookeeper paths should be templated. - https://phabricator.wikimedia.org/T336901 (10Ottomata) Related: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/959059
[16:15:25] <wikibugs>	 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team, 10Patch-For-Review: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10jbond) @BTullis if its the zero padding we could be hitting a bug in [[ https://github.com/ruby...
[16:17:22] <wikibugs>	 10Data-Engineering, 10Anti-Harassment, 10Data Engineering and Event Platform Team, 10Privacy Engineering, and 4 others: Exposing revIDs (nothing more) of deleted/suppressed edits for research to respect their removal - https://phabricator.wikimedia.org/T200559 (10Ottomata) 05Open→03Resolved a:03Ottoma...
[16:19:14] <wikibugs>	 10Data-Engineering, 10Machine-Learning-Team, 10Wikimedia Enterprise, 10Data Engineering and Event Platform Team (Sprint 3), and 2 others: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10Ottomata)
[16:19:16] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Machine-Learning-Team, 10Research, 10Event-Platform: Proposal: Create a stream end point for Revision Risk Model - https://phabricator.wikimedia.org/T326179 (10Ottomata)
[16:20:08] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-extensions-WikimediaEvents, 10Product-Analytics, and 3 others: Decommission the EditorActivation instrument - https://phabricator.wikimedia.org/T330766 (10Ottomata) How we doin here?  :)
[16:21:18] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-Vagrant, 10Event-Platform: EventBus should not blackhole undeclared streams - https://phabricator.wikimedia.org/T329480 (10Ottomata) 05Open→03Invalid
[16:22:39] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Automated event stream throughput alerting for important state change streams - https://phabricator.wikimedia.org/T329070 (10Ottomata) Related: {T345195}
[16:23:52] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks
[16:29:56] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [EPIC] Streaming and event driven Python services - https://phabricator.wikimedia.org/T324689 (10Ottomata)
[16:30:02] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [NEEDS GROOMING] Integrate Flink Table API in eventutils-python - https://phabricator.wikimedia.org/T324953 (10Ottomata) 05Open→03Declined I'm going to be bold and decline this one. If/when we decide to really really suppor...
[16:30:42] <wikibugs>	 10Analytics-Radar, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10SRE, and 2 others: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088 (10Ottomata)
[16:31:54] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Refactor EventBus extension Hooks to use new hook system - https://phabricator.wikimedia.org/T320655 (10Ottomata)
[16:32:03] <wikibugs>	 10Data-Engineering, 10Event-Platform, 10MW-1.41-notes (1.41.0-wmf.29; 2023-10-03), 10Patch-For-Review, 10Platform Team Initiatives (New Hook System): Update EventBus to use the new HookContainer/HookRunner system - https://phabricator.wikimedia.org/T346539 (10Ottomata)
[16:32:28] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team: Drop GuidedTour* tables - https://phabricator.wikimedia.org/T317460 (10Ottomata)
[16:34:34] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Add schema diffing support to jsonschema-tools and run diff in CI - https://phabricator.wikimedia.org/T321850 (10Ottomata) An alternative to do this would be to also materialize another file with a static name that has the full...
[17:02:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:04:32] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w...
[17:05:06] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:10:02] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w...
[17:15:04] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:17:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:18:29] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10bking)
[17:18:36] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink - https://phabricator.wikimedia.org/T326914 (10bking)
[17:19:22] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10bking)
[17:19:33] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink - https://phabricator.wikimedia.org/T326914 (10bking)
[17:29:02] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10dr0ptp4kt) Here's the `sha1sum` for the latest file I had downloaded:  ` /mnt/x$ time sha1sum wikidata.jnl.zst 62327feb2c6ad5b352b5abfe9f0a4d3cc...
[17:33:30] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye
[17:33:34] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye
[17:34:01] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w...
[17:34:39] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye
[17:35:04] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w...
[17:49:29] <wikibugs>	 10Data-Engineering, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'` - https://phabricator.wikimedia.org/T347647 (10dr0ptp4kt) I did manage to run a `sha1sum`...
[17:55:01] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-extensions-WikimediaEvents, 10Product-Analytics, and 3 others: Decommission the EditorActivation instrument - https://phabricator.wikimedia.org/T330766 (10phuedx)
[17:55:18] <wikibugs>	 10Analytics-Kanban, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Fundraising-Backlog, and 3 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10phuedx)
[17:55:36] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-extensions-WikimediaEvents, 10Product-Analytics, and 3 others: Decommission the EditorActivation instrument - https://phabricator.wikimedia.org/T330766 (10phuedx) 05Open→03Resolved a:03phuedx Being **bold**.
[18:28:48] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10bking) Moving to blocked, as we cannot start the test service until T326914 is resolved.
[18:35:06] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[18:40:23] <wikibugs>	 10Data-Engineering, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'` - https://phabricator.wikimedia.org/T347647 (10dr0ptp4kt) 05Open→03Resolved I'm going...
[18:48:23] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w...
[19:11:39] <wikibugs>	 10Analytics, 10Data-Engineering-Radar, 10Data Engineering and Event Platform Team, 10Product-Analytics, 10Event-Platform: [MEP] Determine how stream configuration is authored and deployed - https://phabricator.wikimedia.org/T269774 (10Ottomata) 05Open→03Declined being bold and declining, cc @phuedx
[19:13:26] <wikibugs>	 10Analytics-Radar, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Duplicated revision_create events - https://phabricator.wikimedia.org/T262203 (10Ottomata) @Milimetric could these also be null edits?
[19:14:08] <wikibugs>	 10Analytics-Kanban, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-extensions-EventLogging, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10Ottomata)
[19:14:12] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Metrics Platform Backlog, 10Event-Platform: Document in-schema who sets which fields - https://phabricator.wikimedia.org/T253392 (10Ottomata) 05Open→03Declined Being bold and declining, feel free to reopen.
[19:15:31] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye
[19:15:35] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye
[19:15:55] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w...
[19:15:57] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w...
[19:16:37] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye
[19:16:40] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye
[19:24:11] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10Ottomata) a:03Ottomata
[19:38:07] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w...
[19:41:54] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye
[20:24:07] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks
[20:49:45] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w...
[20:56:39] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w...
[21:14:31] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Investigate recent CirrusSearch p95 latency - https://phabricator.wikimedia.org/T347988 (10bking)
[21:18:00] <jinxer-wm>	 (SystemdUnitFailed) firing: wikidatardf-truthy-dumps.service Failed on snapshot1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:00:34] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Investigate recent CirrusSearch p95 latency - https://phabricator.wikimedia.org/T347988 (10bking)
[22:03:26] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Investigate recent CirrusSearch p95 latency - https://phabricator.wikimedia.org/T347988 (10bking) Looked at this today with @Gehel and @RKemper at SRE pairing.  We focused on the specific time frame of 2023-10-03 02:00:00 to 2023-10-03 09:00:00 . High latency affected b...
[22:35:06] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[23:23:42] <wikibugs>	 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2023/2024-Q1): Replace Quarry with an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Audiodude) So is it correct that we're looking for a new maintainer, but only in the capacity of migrating all usage of Quarry to Superset?...