[00:18:40] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:19:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:31:14] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:34:03] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Restart Search Platform-owned services for Java 8 / Java 11 security updates - https://phabricator.wikimedia.org/T350703 (10RKemper) @MoritzMuehlenhoff Oops it appears we made the same mistake twice :P Can you do one more check for us? I think everything is all set n...
[00:34:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:39:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:49:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (9) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:15:26] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:19:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:42:15] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[01:59:15] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) wmf_auto_restart_prometheus-mysqld-exporter@s1.service Failed on dbstore1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:04:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (9) wmf_auto_restart_prometheus-mysqld-exporter@s1.service Failed on dbstore1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:42:15] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[06:04:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (9) wmf_auto_restart_prometheus-mysqld-exporter@s1.service Failed on dbstore1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:04:21] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products (Data Products Sprint 05): Wikistats - incorrect number of content articles for Latvian Wikipedia - https://phabricator.wikimedia.org/T354074 (10Spnq) I reviewed the data available to me and realised that I had previously misrea...
[07:30:24] <wikibugs>	 (03PS3) 10Cyndywikime: Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989205 (https://phabricator.wikimedia.org/T300273)
[07:30:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989205 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime)
[08:17:36] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Ensure Elastic stack works on bookworm - https://phabricator.wikimedia.org/T353392 (10MoritzMuehlenhoff) >>! In T353392#9444003, @bking wrote: > Per today's pairing session with @RKemper , it looks like our current version of Elastic (7.10.2) do...
[08:41:27] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products (Data Products Sprint 05): Wikistats - incorrect number of content articles for Latvian Wikipedia - https://phabricator.wikimedia.org/T354074 (10Sfaci) Do you mean we don't need the last patch?   Which should be the right value...
[08:48:15] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Wikistats, 10Patch-For-Review: Incorrect number of content pages on stats.wikimedia.org - https://phabricator.wikimedia.org/T354489 (10Sfaci) We are still working in a secondary bug related to the first one. You can catch up in the related ticket {T354074}.  The new pat...
[09:37:03] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): analytics/refinery scap deploy on test cluster fails with permission error - https://phabricator.wikimedia.org/T354703 (10Gehel) p:05Triage→03High
[09:37:07] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): analytics/refinery scap deploy on test cluster fails with permission error - https://phabricator.wikimedia.org/T354703 (10Gehel)
[09:37:10] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Create 3 microsites for wdqs full graph, main graph, & scholarly articles - https://phabricator.wikimedia.org/T354658 (10Gehel) p:05Triage→03High
[09:37:17] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Create 3 microsites for wdqs full graph, main graph, & scholarly articles - https://phabricator.wikimedia.org/T354658 (10Gehel)
[09:37:23] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: [NEEDS GROOMING] schema services should be moved to k8s - https://phabricator.wikimedia.org/T347421 (10Gehel) p:05Triage→03Low
[09:42:15] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:42:38] <wikibugs>	 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 07), and 3 others: EventLoggingTest::testDispatch fails when time ticks within the test run - https://phabricator.wikimedia.org/T353243 (10Sfaci) a:03Sfaci
[09:47:37] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Create Conda Analytics environment for new Spark version - https://phabricator.wikimedia.org/T354733 (10Gehel)
[09:49:27] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Validating that current jobs are running correctly with the new Conda Analytics environment - https://phabricator.wikimedia.org/T354735 (10Gehel)
[09:50:29] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Deploy new Spark version on production - https://phabricator.wikimedia.org/T354737 (10Gehel)
[09:50:46] <wikibugs>	 10Data-Platform-SRE: Create Conda Analytics environment for new Spark version - https://phabricator.wikimedia.org/T354733 (10Gehel)
[09:51:11] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Deploy new Spark version on production - https://phabricator.wikimedia.org/T354737 (10Gehel) p:05Triage→03Medium
[09:51:15] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Deploy new Spark version on production - https://phabricator.wikimedia.org/T354737 (10Gehel) p:05Medium→03High
[09:51:24] <wikibugs>	 10Data-Platform-SRE: Create Conda Analytics environment for new Spark version - https://phabricator.wikimedia.org/T354733 (10Gehel) p:05Triage→03High
[09:51:27] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Validating that current jobs are running correctly with the new Conda Analytics environment - https://phabricator.wikimedia.org/T354735 (10Gehel) p:05Triage→03High
[09:53:12] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Deploy new Spark version on production - https://phabricator.wikimedia.org/T354737 (10Gehel) p:05High→03Medium
[09:53:14] <wikibugs>	 10Data-Platform-SRE: Create Conda Analytics environment for new Spark version - https://phabricator.wikimedia.org/T354733 (10Gehel) p:05High→03Medium
[09:53:16] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Validating that current jobs are running correctly with the new Conda Analytics environment - https://phabricator.wikimedia.org/T354735 (10Gehel) p:05High→03Medium
[10:04:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (9) wmf_auto_restart_prometheus-mysqld-exporter@s1.service Failed on dbstore1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:25:00] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10decommission-hardware: decommission druid1004.eqiad.wmnet - https://phabricator.wikimedia.org/T354741 (10Stevemunene)
[10:25:59] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10decommission-hardware: decommission druid1005.eqiad.wmnet - https://phabricator.wikimedia.org/T354742 (10Stevemunene)
[10:26:52] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10decommission-hardware: decommission druid1006.eqiad.wmnet - https://phabricator.wikimedia.org/T354743 (10Stevemunene)
[10:27:28] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10decommission-hardware: decommission druid1006.eqiad.wmnet - https://phabricator.wikimedia.org/T354743 (10Stevemunene)
[10:27:30] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene)
[10:27:45] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10decommission-hardware: decommission druid1005.eqiad.wmnet - https://phabricator.wikimedia.org/T354742 (10Stevemunene)
[10:27:47] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene)
[10:28:09] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10decommission-hardware: decommission druid1004.eqiad.wmnet - https://phabricator.wikimedia.org/T354741 (10Stevemunene)
[10:28:11] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene)
[11:29:23] <btullis>	 FYI, I am about to merge this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/989213/ which will set up an-master1003 as a hadoop-master-in-waiting for T332573
[11:29:24] <stashbot>	 T332573: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573
[11:30:11] <wikibugs>	 10Data-Engineering, 10Release-Engineering-Team, 10GitLab (CI & Job Runners): Unblock Dockerfile syntax to build images with Gitlab trusted runner - https://phabricator.wikimedia.org/T351792 (10gmodena) Thanks for clarifying @Antoine_Quhen .   >>! In T351792#9446831, @Antoine_Quhen wrote: > 1/ Splitting the C...
[11:34:13] <wikibugs>	 (03PS4) 10Cyndywikime: Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989205 (https://phabricator.wikimedia.org/T300273)
[11:34:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989205 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime)
[11:39:05] <stevemunene>	 !log decommission druid1004.eqiad.wmnet T354741
[11:39:08] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:39:09] <stashbot>	 T354741: decommission druid1004.eqiad.wmnet - https://phabricator.wikimedia.org/T354741
[11:39:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) hadoop-hdfs-namenode.service Failed on an-master1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:40:19] <jinxer-wm>	 (SystemdUnitFailed) firing: (11) hadoop-hdfs-namenode.service Failed on an-master1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:44:41] <wikibugs>	 (03PS5) 10Cyndywikime: Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989205 (https://phabricator.wikimedia.org/T300273)
[11:45:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989205 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime)
[11:56:06] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10decommission-hardware: decommission druid1004.eqiad.wmnet - https://phabricator.wikimedia.org/T354741 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by stevemunene@cumin1002 for hosts: `druid1004.eqiad.wmnet` - druid1004.eqiad.wmnet (**PASS**)...
[11:57:28] <wikibugs>	 (03PS1) 10Cyndywikime: Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989487 (https://phabricator.wikimedia.org/T300273)
[11:58:06] <wikibugs>	 10Data-Engineering, 10Release-Engineering-Team, 10GitLab (CI & Job Runners): Unblock Dockerfile syntax to build images with Gitlab trusted runner - https://phabricator.wikimedia.org/T351792 (10BTullis) >>! In T351792#9449781, @gmodena wrote: > Thanks for clarifying @Antoine_Quhen . >> I have an alternative i...
[11:58:44] <wikibugs>	 (03Abandoned) 10Cyndywikime: Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989205 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime)
[12:05:51] <stevemunene>	 !log decommission druid1005.eqiad.wmnet T354742
[12:05:53] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:05:54] <stashbot>	 T354742: decommission druid1005.eqiad.wmnet - https://phabricator.wikimedia.org/T354742
[12:21:31] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10decommission-hardware: decommission druid1005.eqiad.wmnet - https://phabricator.wikimedia.org/T354742 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by stevemunene@cumin1002 for hosts: `druid1005.eqiad.wmnet` - druid1005.eqiad.wmnet (**PASS**)...
[12:22:37] <stevemunene>	 !log decommission druid1006.eqiad.wmnet T354743
[12:22:39] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:22:40] <stashbot>	 T354743: decommission druid1006.eqiad.wmnet - https://phabricator.wikimedia.org/T354743
[12:39:03] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10decommission-hardware: decommission druid1006.eqiad.wmnet - https://phabricator.wikimedia.org/T354743 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by stevemunene@cumin1002 for hosts: `druid1006.eqiad.wmnet` - druid1006.eqiad.wmnet (**PASS**)...
[12:40:51] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene)
[12:47:30] <stevemunene>	 !log roll restarting hadoop test workers to pick up new JRE
[12:47:32] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:09:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on an-master1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[13:23:06] <jinxer-wm>	 (KafkaReplicationFactorTooLow) firing: (860) Kafka topic PlatformEvent_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor  - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow
[13:28:43] <jinxer-wm>	 (KafkaReplicationFactorTooLow) resolved: (860) Kafka topic PlatformEvent_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor  - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow
[13:39:11] <jinxer-wm>	 (SystemdUnitFailed) firing: (11) hadoop-hdfs-namenode.service Failed on an-master1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:40:08] <btullis>	 ^ this can be ignored. I thought I had downtimed it, whilst an-master1003 is being brought into service.
[13:42:15] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[13:45:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[14:06:48] <wikibugs>	 10Data-Engineering (Sprint 7): [Maintenance] Migrate ReportUpdater browser queries to Airflow - https://phabricator.wikimedia.org/T354552 (10Snwachukwu) a:03Snwachukwu
[14:42:26] <wikibugs>	 10Data-Engineering, 10MediaWiki-General, 10Event-Platform, 10Patch-For-Review: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 (10Ottomata) Oh, and actually, we only need to count requests to mediawiki.org/beacon/event,...
[14:49:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (11) hadoop-yarn-resourcemanager.service Failed on an-master1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:49:59] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on an-master1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:02:18] <wikibugs>	 (03PS2) 10Snwachukwu: Migration of browser General table to iceberg format. 1. Add a iceberg create table statement hql file fof browser_general table 2. Add hql file to update browser_general iceberg table with values. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/988711 (https://phabricator.wikimedia.org/T352670)
[15:04:47] <wikibugs>	 (03CR) 10Snwachukwu: Migration of browser General table to iceberg format. 1. Add a iceberg create table statement hql file fof browser_general table 2. Add hql  (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/988711 (https://phabricator.wikimedia.org/T352670) (owner: 10Snwachukwu)
[15:09:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (12) hadoop-mapreduce-historyserver.service Failed on an-master1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:11:39] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) This going quite well, so far.  I have brought up an-master1003 in the `analytics_cluster::hadoop::master` role and bootstrapped the namenod...
[15:13:24] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis)
[15:13:30] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e7c141a8-f6e4-4385-b205-cac59bd12d90) set by btullis@cumin1002 for 7 days, 0:00:00 o...
[15:43:51] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search: SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 (10bking) 05Open→03In progress
[15:52:45] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Restart Search Platform-owned services for Java 8 / Java 11 security updates - https://phabricator.wikimedia.org/T350703 (10MoritzMuehlenhoff) Looks all good now, thanks
[16:06:55] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) For reference, here's a screenshot of more kafka metrics around enabling compaction:  {F41...
[16:15:08] <wikibugs>	 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Investigate Spark History Server silent errors when downloading some files from HDFS - https://phabricator.wikimedia.org/T354777 (10brouberol)
[16:17:24] <wikibugs>	 10Data-Engineering (Sprint 7), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol)
[16:19:14] <wikibugs>	 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10VRiley-WMF)
[16:20:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[16:38:48] <wikibugs>	 10Data-Engineering, 10Data Products (Data Products Sprint 05): Make defaults immutable for Airflow confs - https://phabricator.wikimedia.org/T325014 (10xcollazo) This was deployed today and it is looking good in prod.
[16:38:54] <wikibugs>	 (03PS1) 10Mforns: Add query to load MediaWiki snapshot to Cassandra AQS config table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/989558 (https://phabricator.wikimedia.org/T352948)
[16:42:23] <wikibugs>	 10Data-Engineering (Sprint 7), 10Data-Platform-SRE: Pin docker images to an explicit tag instead of using latest - https://phabricator.wikimedia.org/T354785 (10brouberol)
[16:42:36] <wikibugs>	 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Pin docker images to an explicit tag instead of using latest - https://phabricator.wikimedia.org/T354785 (10brouberol)
[16:47:32] <wikibugs>	 (03PS2) 10Mforns: Add query to load MediaWiki snapshot to Cassandra AQS config table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/989558 (https://phabricator.wikimedia.org/T352948)
[17:07:33] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products (Data Products Sprint 07): Wikistats - incorrect number of content articles for Latvian Wikipedia - https://phabricator.wikimedia.org/T354074 (10WDoranWMF)
[17:20:40] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Ensure Elastic stack works on bookworm - https://phabricator.wikimedia.org/T353392 (10bking) Thanks for the update! I'll set this to blocked. No rush or anything, but do reach out when the packages are ready or if we can do anything to help.
[17:22:17] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search: SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 (10bking)
[17:29:18] <wikibugs>	 10Data-Engineering (Sprint 7): [Refine System] Define a concept and an approach for refactoring the Refine system - https://phabricator.wikimedia.org/T354696 (10Ahoelzl) a:03JAllemandou
[17:29:31] <wikibugs>	 10Data-Engineering (Sprint 7): [Maintenance] Safeguard VarnishKafka to HAProxy analytics transition - https://phabricator.wikimedia.org/T354694 (10Ahoelzl) a:03gmodena
[17:30:02] <wikibugs>	 10Data-Engineering (Sprint 7): [Data Quality] Implement basic data quality metrics for MW history - https://phabricator.wikimedia.org/T354692 (10Ahoelzl) a:03Antoine_Quhen
[17:30:31] <wikibugs>	 10Data-Engineering (Sprint 7): [Iceberg Migration] Define sensor concept - https://phabricator.wikimedia.org/T354695 (10Ahoelzl) a:03Antoine_Quhen
[17:46:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[18:06:41] <wikibugs>	 10Data-Engineering, 10Movement-Insights, 10Traffic, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10dr0ptp4kt)
[18:07:47] <wikibugs>	 10Quarry: build container on PR - https://phabricator.wikimedia.org/T316958 (10rook) 05Open→03Resolved
[18:07:57] <wikibugs>	 10Quarry, 10GitLab (Project Migration): Move Quarry from Gerrit to GitHub - https://phabricator.wikimedia.org/T308978 (10rook)
[18:09:28] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis)
[18:17:31] <wikibugs>	 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10rook) 05In progress→03Resolved
[18:37:53] <wikibugs>	 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Pin docker images to an explicit tag instead of using latest - https://phabricator.wikimedia.org/T354785 (10brouberol) 05Open→03Resolved
[18:37:58] <wikibugs>	 10Data-Engineering (Sprint 7), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol)
[18:39:44] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10serviceops: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10bking)
[18:44:10] <wikibugs>	 (03CR) 10Sergio Gimeno: Add analytics for Impressions, Success and Abandonment of account creation (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime)
[18:49:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on an-test-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[18:52:48] <wikibugs>	 10Data-Engineering, 10Movement-Insights, 10Traffic, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10dr0ptp4kt) I'm scheduling time with @Mayakp.wiki and @MGerlach to soon discuss potential future use cases, but if folks familiar with...
[18:58:23] <wikibugs>	 10Data-Engineering, 10Movement-Insights, 10Traffic, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10dr0ptp4kt)
[19:04:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:05:19] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:16:02] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10serviceops: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10bking)
[19:25:07] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search: SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 (10bking)
[21:47:16] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[22:03:20] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search: SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 (10bking)
[22:49:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on an-test-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[23:05:19] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed