[00:00:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:07:59] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Movement-Insights, 10Product-Analytics: Reconstruct Hive & Hadoop permissions for shared database - https://phabricator.wikimedia.org/T288983 (10Mayakp.wiki)
[00:20:04] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Wmfdata-Python triggers a Pandas warning during mariadb.run and hive.run - https://phabricator.wikimedia.org/T324135 (10nshahquinn-wmf)
[00:42:10] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10nshahquinn-wmf) @BTullis I just released Wmfdata 2.0.1 with a [few small fixes](https://github.com/w...
[00:56:39] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Remove Wmfdata's custom update-notification code - https://phabricator.wikimedia.org/T346706 (10nshahquinn-wmf)
[01:01:19] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Remove Wmfdata code related to the conda-analytics migration - https://phabricator.wikimedia.org/T346707 (10nshahquinn-wmf)
[01:15:19] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:20:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:38:42] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:40:03] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:18:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:35:54] <wikibugs>	 10Data-Engineering, 10Data-Services, 10cloud-services-team: Surface Temporary user information to Cloud Wiki Replicas - https://phabricator.wikimedia.org/T346679 (10Marostegui)
[07:33:36] <brouberol>	 I just realized that `helmfile` was written by an ex colleague of mine. This is what the name rang a bell
[07:57:02] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1146.eqiad.wmnet with OS bullseye
[07:59:24] <brouberol>	 !log redeploying eventgate-analytics in staging T336041
[07:59:27] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:59:27] <stashbot>	 T336041: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041
[08:02:23] <brouberol>	 !log redeploying eventgate-analytics-external in staging T336041
[08:02:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:05:07] <brouberol>	 !log redeploying eventstream-internal  in staging T336041
[08:05:16] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:05:16] <stashbot>	 T336041: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041
[08:10:47] <brouberol>	 stevemunene: I have successfully deployed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/958496 in staging. Is there an extra step I have to take to deploy to production (only deploy during a given time window, make an announcement of some sort, etc)?
[08:12:33] <btullis>	 brouberol: I can answer this one :-) No, nothing usually. It's usually straight on to codfw and then eqiad, as long as no problems have been observed in staging.
[08:15:34] <brouberol>	 alrighty then! (I asked Steve because you've been my go to person for any and all questions for the past 2 weeks and you might be getting tired of it)
[08:17:01] <btullis>	 Not a problem. :-)
[08:17:29] <brouberol>	 !log redeploying eventstream-analytics in eqiad T336041
[08:17:33] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:17:33] <stashbot>	 T336041: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041
[08:18:57] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:20:04] <wikibugs>	 10Data-Platform-SRE: Bring stat1010 into service with GPU from stat1005 - https://phabricator.wikimedia.org/T336040 (10Gehel) Confirmed that the GPU is currently still in stat1005
[08:26:55] <btullis>	 team: - I think I'd like to proceed with this kafka-jumbo max-message size settings change today, if that's ok. https://gerrit.wikimedia.org/r/c/operations/puppet/+/952160/ - I have a +1 but I'd be keen for a double-check.
[08:27:33] <elukey>	 +1!
[08:28:17] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol)
[08:36:20] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1146.eqiad.wmnet with OS bullseye completed: - an-worker1146 (**WARN**)   - Downtimed on Icinga/Alertm...
[08:53:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:58:21] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Product-Analytics: Conda analytics environments breakage - conflicting dependencies between r-base and other - https://phabricator.wikimedia.org/T343823 (10Gehel) We need a higher level strategy on how we are exposing our compute / data platform. @lbowmaker should b...
[09:00:03] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: [SPIKE] Can we identify indicators to inform an SLO for event emission and intake? - https://phabricator.wikimedia.org/T345195 (10gmodena)
[09:03:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:08:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:13:54] <wikibugs>	 10Data-Platform-SRE: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) a:03Stevemunene
[09:18:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:23:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:26:48] <wikibugs>	 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10dcaro)
[09:27:56] <btullis>	 !log deploying change to kafka-jumbo settings for T344688
[09:27:59] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:27:59] <stashbot>	 T344688: Increase Max Message Size in Kafka Jumbo  - https://phabricator.wikimedia.org/T344688
[09:27:59] <brouberol>	 btullis: here's the tiny Gobblin config change I was talking about, if you have 2 min
[09:28:05] <brouberol>	 https://gerrit.wikimedia.org/r/c/analytics/refinery/+/958511/
[09:31:48] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958511 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[09:32:17] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] [Gobblin] Add kafka-jumbo1010 to config [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958511 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[09:33:04] <btullis>	 brouberol: Looks good +1d it. I would make a reference to it in https://etherpad.wikimedia.org/p/analytics-weekly-train and then it can go out the next time someone deploys refinery, which will likely be this afternoon.
[09:33:17] <brouberol>	 already done
[09:33:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:34:28] <brouberol>	 however, it seems a +1 + another self-applied +1 isn't enough to merge
[09:35:18] <brouberol>	 do these CRs get merged by the data-eng team when the train leaves?
[09:35:19] <btullis>	 Do you not have +2 rights yet? I can add you if not. 
[09:35:29] <brouberol>	 not on that repo, it seems
[09:36:31] <btullis>	 Hang on a sec... I can add you. There was some discussion about this kind of thing on Slack yesterday. https://wikimedia.slack.com/archives/CSV483812/p1695041181809139
[09:37:31] <btullis>	 Can you try again now please?
[09:38:25] <btullis>	 > do these CRs get merged by the data-eng team when the train leaves?
[09:38:25] <btullis>	 I would say, sometimes. Might depend on the comment in the etherpad and/or whether any corresponding changes are required to other repos.
[09:38:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:38:53] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] [Gobblin] Add kafka-jumbo1010 to config [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958511 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[09:39:01] <brouberol>	 it worked, thanks!
[09:39:05] <wikibugs>	 (03CR) 10Brouberol: [V: 03+2 C: 03+2] [Gobblin] Add kafka-jumbo1010 to config [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958511 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[09:40:25] <btullis>	 !log commencing rolling restart of all brokers in kafka-jumbo
[09:40:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:42:57] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1147.eqiad.wmnet with OS bullseye
[09:48:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:53:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:58:30] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1148.eqiad.wmnet with OS bullseye
[10:03:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:13:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:18:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:25:05] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1147.eqiad.wmnet with OS bullseye completed: - an-worker1147 (**PASS**)   - Downtimed on Icinga/Alertm...
[10:38:33] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1148.eqiad.wmnet with OS bullseye completed: - an-worker1148 (**PASS**)   - Downtimed on Icinga/Alertm...
[10:38:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:41:01] <wikibugs>	 (03PS7) 10Btullis: Update to Superset version 2.2.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356)
[10:44:46] <wikibugs>	 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10Milimetric) Ok, to resolve I'm going to erase this dvd.html file from all the dumpsdata hosts as [[ https://wikitech....
[10:48:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:56:43] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10Stevemunene) We have successfully completed the hadoop worker upgrades to Bullseye.  ` sudo cumin --no-progress a:hadoop-worker 'cat /etc/debian_version' 86 hosts will be targeted: an-worker[1078-1095,1097-1156...
[10:57:50] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) >>! In T332570#9177923, @Stevemunene wrote: > We have successfully completed the hadoop worker upgrades to Bullseye.  Excellent!
[11:08:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:18:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:24:45] <wikibugs>	 (03PS1) 10Jennifer Ebe: Update changelog for v0.2.22 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/958909
[11:25:39] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/958909 (owner: 10Jennifer Ebe)
[11:26:51] <wmf-insecte>	 Starting build #127 for job analytics-refinery-maven-release-docker
[11:38:17] <wikibugs>	 (03PS8) 10Btullis: Update to Superset version 3.0.0 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356)
[11:39:36] <wmf-insecte>	 Project analytics-refinery-maven-release-docker build #127: 09SUCCESS in 12 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/127/
[11:40:50] <wmf-insecte>	 Starting build #86 for job analytics-refinery-update-jars-docker
[11:41:12] <wikibugs>	 (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.22 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/957825
[11:41:13] <wmf-insecte>	 Project analytics-refinery-update-jars-docker build #86: 09SUCCESS in 22 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/86/
[11:44:25] <wikibugs>	 (03CR) 10Jennifer Ebe: [V: 03+2 C: 03+2] "Merging for deployment" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/957825 (owner: 10Maven-release-user)
[12:02:23] <jennifer_ebe>	 !log deploying refinery from deployment.eqiad.wmnet
[12:02:25] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:10:58] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Define priorities for HDFS data to be backed up - https://phabricator.wikimedia.org/T283261 (10JAllemandou) a:05JAllemandou→03None
[12:26:52] <brouberol>	 btullis: is the kafka-jumbo rr still ongoing?
[12:27:08] <joal>	 btullis - hello
[12:27:59] <joal>	 btullis: we're having an issue with our deployment - I think it's related to this patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/950194
[12:28:06] <joal>	 btullis: maybe
[12:28:18] <btullis>	 brouberol: Yes 'fraid so. 8 out of 10 have been restarted.
[12:28:28] <joal>	 the problem: we (either jennifer_ebe our myself) can't sudo as hdfs on an-launcher1002
[12:28:38] <joal>	 btullis: --^
[12:28:52] <brouberol>	 no worries! I'm catching up on other tasks in the meantime
[12:29:13] <btullis>	 joal: OK, shall we jump on a meeting? 
[12:29:21] <joal>	 sure - batcave
[12:42:51] <jinxer-wm>	 (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength
[12:44:06] <elukey>	 haven't seen --^ in a while :D
[12:44:12] <brouberol>	 and general question: I see that we're waiting for >=900s when we restart kafka services until we allow ourselves to move on to the next one. I was wondering if it would make sense to provide a command `kafka-check-broker-insync --broker-id xxxx` that would check whether all assigned partitions are back into the ISR. This way, if the broker catches
[12:44:12] <brouberol>	 up in less than 900s, we'd be able to have faster rolling restarts, and if the broker catches up in *more* than 900s, we'd avoid putting the cluster at risk by rebooting another broker.
[12:44:42] <elukey>	 brouberol: yes definitely we never really got to that point
[12:45:07] * brouberol makes a note to create a ticket
[12:45:52] <brouberol>	 note: this is how we were doing w/ kafka in kube: the readiness probe of the pod was checking for the in-sync status of all partitions assigned to the broker + some additional wait time for good measure
[12:46:09] <brouberol>	 in $PREV_JOB I mean
[12:47:51] <jinxer-wm>	 (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength
[12:49:16] <wikibugs>	 (03PS1) 10Btullis: Update refinery-deploy-to-hdfs to use sudo [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958927 (https://phabricator.wikimedia.org/T334493)
[12:50:22] <wikibugs>	 (03PS2) 10Btullis: Update refinery-deploy-to-hdfs to use sudo [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958927 (https://phabricator.wikimedia.org/T334493)
[12:51:05] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging to fix deploy :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958927 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis)
[12:56:53] <wikibugs>	 10Data-Platform-SRE: Use the replication status of assigned partitions as a gate between kafka broker rolling-restarts - https://phabricator.wikimedia.org/T346741 (10brouberol)
[13:05:43] <btullis>	 brouberol: The cookbook has finished. kafka-jumbo restarted.
[13:06:25] <brouberol>	 :+1 thanks! if that's alright with you, I'll proceed with the provisioning of kafka-jumbo1011->1015 after a rebase
[13:07:26] <jennifer_ebe>	 !log redeploying refinery from deployment.eqiad.wmnet using scap
[13:07:27] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:11:01] <brouberol>	 in that spirit, can I ask for an approval of https://gerrit.wikimedia.org/r/c/operations/puppet/+/957919/10 and all subsequent PRs? Thanks!
[13:21:55] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10BTullis) This is deployed and all of the brokers in the kafka-jumbo cluster have been restarted. I'll lea...
[13:23:07] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Use the replication status of assigned partitions as a gate between kafka broker rolling-restarts - https://phabricator.wikimedia.org/T346741 (10brouberol)
[13:32:46] <jennifer_ebe>	 !log deployment successful
[13:32:47] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:42:50] <wikibugs>	 10Data-Platform-SRE, 10decommission-hardware, 10Patch-For-Review: decommission an-test-client1001.eqiad.wmnet - https://phabricator.wikimedia.org/T343520 (10Stevemunene)
[13:43:37] <jennifer_ebe>	 !log deploying airflow analytics dag
[13:43:39] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:46:10] <joal>	 Hi mforns - Would you mind checking https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/488 when you have a minute? I have changed the second job as well :)
[13:48:11] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:48:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:52:50] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:53:38] <wikibugs>	 10Data-Platform-SRE, 10decommission-hardware, 10Patch-For-Review: decommission an-test-client1001.eqiad.wmnet - https://phabricator.wikimedia.org/T343520 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by stevemunene@cumin1001 for hosts: `an-test-client1001.eqiad.wmnet` - an-test-client1001....
[13:53:46] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol)
[13:54:31] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol)
[13:57:44] <btullis>	 !log pushing out https://gerrit.wikimedia.org/r/c/operations/puppet/+/955893 for new refinery job jar files
[13:57:45] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:57:45] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:59:44] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1015 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:00:19] <wikibugs>	 10Data-Engineering, 10Structured-Data-Backlog: NEW BUG REPORT fiwiki’s section-level image suggestions aren’t generated in production - https://phabricator.wikimedia.org/T343844 (10mfossati)
[14:00:47] <wikibugs>	 (03PS1) 10Brouberol: Add kafka-jumbo10[11-15].eqiad.wmnet to the gobblin broker list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958939 (https://phabricator.wikimedia.org/T336041)
[14:02:51] <wikibugs>	 10Data-Platform-SRE, 10decommission-hardware, 10Patch-For-Review: decommission an-test-client1001.eqiad.wmnet - https://phabricator.wikimedia.org/T343520 (10Stevemunene)
[14:03:36] <mforns>	 joal will look!
[14:07:24] <icinga-wm>	 PROBLEM - Kafka Broker Server #page on kafka-jumbo1015 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[14:08:01] <wikibugs>	 10Data-Engineering, 10Structured-Data-Backlog: NEW BUG REPORT fiwiki’s section-level image suggestions aren’t generated in production - https://phabricator.wikimedia.org/T343844 (10mfossati) CC @xcollazo @JAllemandou @BTullis for your consideration.
[14:10:34] <wikibugs>	 10Data-Platform-SRE: Package kafka-kit binaries (topicmappr, metricsfetcher, ...) as a debian-package - https://phabricator.wikimedia.org/T346763 (10brouberol)
[14:11:20] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[14:12:37] <wikibugs>	 10Data-Platform-SRE: Install kafka-kit binaries on kafka brokers - https://phabricator.wikimedia.org/T346764 (10brouberol)
[14:12:48] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:13:02] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Reassign partitions away from kafka-jumbo100[1-6] to kafka-jumbo10[09-15] brokers - https://phabricator.wikimedia.org/T346425 (10brouberol)
[14:13:04] <wikibugs>	 10Data-Platform-SRE: Install kafka-kit binaries on kafka brokers - https://phabricator.wikimedia.org/T346764 (10brouberol)
[14:13:07] <wikibugs>	 10Data-Platform-SRE: Package kafka-kit binaries (topicmappr, metricsfetcher, ...) as a debian-package - https://phabricator.wikimedia.org/T346763 (10brouberol)
[14:13:14] <icinga-wm>	 RECOVERY - Kafka Broker Server #page on kafka-jumbo1015 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[14:13:46] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-jumbo1015 is OK: SSL OK - Certificate kafka-jumbo1015.eqiad.wmnet valid until 2024-09-18 13:48:00 +0000 (expires in 364 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[14:14:51] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol)
[14:14:55] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Reassign partitions away from kafka-jumbo100[1-6] to kafka-jumbo10[09-15] brokers - https://phabricator.wikimedia.org/T346425 (10brouberol)
[14:15:03] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol)
[14:15:07] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Reassign partitions away from kafka-jumbo100[1-6] to kafka-jumbo10[09-15] brokers - https://phabricator.wikimedia.org/T346425 (10brouberol)
[14:16:38] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m on alert1001 is CRITICAL: 4.442e+04 gt 1000 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad
[14:17:36] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:19:25] <jennifer_ebe>	 !log airflow analytics deployment with scap successful
[14:19:27] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:21:08] <wikibugs>	 (03PS9) 10Btullis: Update to Superset version 2.1.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356)
[14:29:52] <wikibugs>	 (03PS1) 10Milimetric: Send etag header on all AQS responses [analytics/aqs] - 10https://gerrit.wikimedia.org/r/958945 (https://phabricator.wikimedia.org/T342213)
[14:30:47] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) We successfully implemented OIDC on production datahub and auth/login seems to be working great. However there are some challenges with the user jour...
[14:32:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Send etag header on all AQS responses [analytics/aqs] - 10https://gerrit.wikimedia.org/r/958945 (https://phabricator.wikimedia.org/T342213) (owner: 10Milimetric)
[14:33:04] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m on alert1001 is OK: (C)1000 gt (W)100 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad
[14:35:43] <wikibugs>	 10Data-Engineering, 10Dumps 2.0, 10Data Products (Sprint 00): Develop Dumps Triage Runbook - https://phabricator.wikimedia.org/T343325 (10VirginiaPoundstone) 05Open→03Resolved
[14:37:56] <wikibugs>	 10Data-Platform-SRE: analytics/refinery deployment broken at refinery-deploy-to-hdfs - https://phabricator.wikimedia.org/T334493 (10BTullis) 05Open→03Resolved
[14:38:47] <wikibugs>	 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10phuedx) >>! In T326002#9176238, @gmodena wrote: > == Next steps > - []...
[14:50:20] <wikibugs>	 10Data-Platform-SRE, 10Data-Catalog: DataHub rights assignment is case-sensitive - https://phabricator.wikimedia.org/T309382 (10Stevemunene) 05Open→03Resolved This was resolved by the switch to OIDC, marking it as resolved.
[15:00:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ...
[15:00:27] <jinxer-wm>	 mw_page_content_change_enrich in eqiad is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning
[15:07:36] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:16:16] <wikibugs>	 (03CR) 10TChin: [C: 03+2] Skip schema-deterministic-types for metrics_event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/954097 (https://phabricator.wikimedia.org/T344511) (owner: 10TChin)
[15:16:50] <wikibugs>	 (03Merged) 10jenkins-bot: Skip schema-deterministic-types for metrics_event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/954097 (https://phabricator.wikimedia.org/T344511) (owner: 10TChin)
[15:17:45] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:24:17] <wikibugs>	 10Data-Platform-SRE: Install kafka-kit binaries on kafka brokers - https://phabricator.wikimedia.org/T346764 (10BTullis) This might be interesting in terms of the Debian packaging: https://gitlab.wikimedia.org/repos/sre/wmf-debci It is the state of the art in terms of our CI based packaging, but I haven't tried...
[15:48:43] <wikibugs>	 10Data-Platform-SRE: Decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T344198 (10Gehel)
[15:49:07] <wikibugs>	 10Data-Platform-SRE: Decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T344198 (10Gehel) 05Open→03Resolved
[15:49:20] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10bking)
[15:49:25] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Restore dse-k8s' rdf-streaming-updater from savepoint/improve bootstrapping process - https://phabricator.wikimedia.org/T345957 (10bking) 05Open→03Resolved Thanks for all your help as well. I believe this is done, but please let us know if we need to...
[16:14:42] <wikibugs>	 (03Abandoned) 10Mforns: Remove null entry from custom_data.[].value enum in monoschema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/958528 (https://phabricator.wikimedia.org/T344235) (owner: 10Mforns)
[16:17:20] <wikibugs>	 (03PS1) 10Mforns: Remove null entry from custom_data.[].value enum in monoschema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/958992 (https://phabricator.wikimedia.org/T344235)
[16:23:18] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10SRE: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10BTullis) Disk usage hit 100% and I did this again: ` btullis@krb1001:~$ sudo truncate -s 10000 /var/log/kerberos/krb5kdc.log `  This was the size beforehand.  ` btullis@...
[16:41:01] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking)
[16:41:18] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10bking) 05Open→03Resolved The patch above ensures that Data Platform SREs will be alerted if there's a problem with the flink-zk...
[16:43:04] <wikibugs>	 10Data-Engineering, 10Structured-Data-Backlog: NEW BUG REPORT fiwiki’s section-level image suggestions aren’t generated in production - https://phabricator.wikimedia.org/T343844 (10xcollazo) confirmed that the artifacts in question are the same:  The one deployed on Airflow: ` xcollazo@stat1007:/mnt/hdfs/wmf/c...
[17:10:51] <wikibugs>	 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10bking) Per today's Data Platform SRE meeting, I've committed to lead this effort. To that end, I've [[ https://docs.google.com/spreadsheets/d/1hwVX_2va8pHgJJca4r8...
[17:12:36] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:13:58] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1085 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:15:24] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:16:30] <wikibugs>	 10Data-Platform-SRE, 10observability, 10Epic: Review alerting around Search update pipeline - https://phabricator.wikimedia.org/T346807 (10bking)
[17:17:36] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:23:43] <wikibugs>	 (03PS10) 10Btullis: Update to Superset version 2.0.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356)
[17:55:24] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Add kafka-jumbo10[11-15].eqiad.wmnet to the gobblin broker list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958939 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[18:42:56] <wikibugs>	 (03CR) 10Bearloga: [C: 03+2] Minor change to stream name [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/956937 (owner: 10Sharvaniharan)
[18:43:32] <wikibugs>	 (03Merged) 10jenkins-bot: Minor change to stream name [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/956937 (owner: 10Sharvaniharan)
[19:00:42] <jinxer-wm>	 (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ...
[19:00:42] <jinxer-wm>	 mw_page_content_change_enrich in eqiad is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning
[19:15:20] <wikibugs>	 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, and 2 others: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10gmodena)
[19:17:22] <wikibugs>	 10Data-Engineering, 10serviceops, 10Discovery-Search (Current work), 10Event-Platform: Improve the flink-app chart to provide more useful defaults - https://phabricator.wikimedia.org/T346315 (10RKemper)
[19:33:29] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Add kafka-jumbo10[11-15].eqiad.wmnet to the gobblin broker list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958939 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[19:33:31] <wikibugs>	 (03CR) 10Brouberol: [V: 03+2 C: 03+2] Add kafka-jumbo10[11-15].eqiad.wmnet to the gobblin broker list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958939 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[20:12:32] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 2), 10Patch-For-Review: Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10JAllemandou)
[21:17:36] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:19:14] <wikibugs>	 10Data-Platform-SRE: Write new partman recipe for cloudelastic (jbod) and update relevant Elastic config - https://phabricator.wikimedia.org/T342463 (10bking) a:03bking This is blocking T342538 , so I'm starting on it now.
[21:21:30] <wikibugs>	 10Data-Platform-SRE: Write new partman recipe for cloudelastic (jbod) and update relevant Elastic config - https://phabricator.wikimedia.org/T342463 (10bking)
[21:27:36] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:32:36] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:39:07] <wikibugs>	 10Data-Engineering, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Movement-Insights, and 2 others: Unique Devices seasonal trends on small projects - https://phabricator.wikimedia.org/T344381 (10Mayakp.wiki) @CorinnaHillebrand_WMDE , can you please confirm if this is completed? and when wa...
[22:10:03] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform, 10Patch-For-Review: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10bking) Once the patch above is merged, I think we'll need to do a little o...
[23:00:42] <jinxer-wm>	 (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ...
[23:00:42] <jinxer-wm>	 mw_page_content_change_enrich in eqiad is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning
[23:25:33] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10colewhite) >>! In T341792#9179651, @gerritbot wrote: > Change 958991 **merged** by Btullis: > %%%[operations/puppet@production] Add...