[01:16:58] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 0.3368% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[01:48:27] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:06:05] <jinxer-wm>	 (KafkaReplicationFactorTooLow) firing: ...
[02:06:05] <jinxer-wm>	 Kafka topic codfw.cpjobqueue.retry.mediawiki.job.ImageSuggestionsNotifications replication factor is too low on main-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?var-kafka_cluster=main-eqiad&var-kafka_broker=All&var-topic=codfw.cpjobqueue.retry.mediawiki.job.ImageSuggestionsNotifications&viewPanel=40 - ...
[02:06:05] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow
[02:11:05] <jinxer-wm>	 (KafkaReplicationFactorTooLow) resolved: ...
[02:11:05] <jinxer-wm>	 Kafka topic codfw.cpjobqueue.retry.mediawiki.job.ImageSuggestionsNotifications replication factor is too low on main-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?var-kafka_cluster=main-eqiad&var-kafka_broker=All&var-topic=codfw.cpjobqueue.retry.mediawiki.job.ImageSuggestionsNotifications&viewPanel=40 - ...
[02:11:05] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow
[04:28:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (15) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:16:58] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 0.3266% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[07:19:20] <wikibugs>	 (03PS3) 10Snwachukwu: [WIP] Add Reportupdater Browser All Sites Queries. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/995740 (https://phabricator.wikimedia.org/T354552)
[08:08:27] <jinxer-wm>	 (SystemdUnitFailed) firing: (16) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:16:59] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 0.2512% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[09:40:02] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[10:01:50] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10BTullis) Yes, the timeout happens reliably at around 65 seconds. Thanks @Vgutierrez - I'll remove the Traffic tag and focus on envoy.
[10:10:22] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10BTullis) a:03BTullis
[10:38:27] <jinxer-wm>	 (SystemdUnitFailed) firing: (17) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:14:20] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[11:30:21] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Problem downloading large files from analytics.wikimedia.org - https://phabricator.wikimedia.org/T356792 (10BTullis) Thanks again @Vgutierrez - That makes perfect sense now. As per: https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/...
[11:58:04] <wikibugs>	 (03PS3) 10Gmodena: WIP - Add webrequest schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/983898 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata)
[11:58:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP - Add webrequest schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/983898 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata)
[12:02:38] <wikibugs>	 10Data-Engineering, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula, and 2 others: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10jcrespo) This is looking good- I will check in th...
[12:06:36] <wikibugs>	 (03PS4) 10Gmodena: WIP - Add webrequest schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/983898 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata)
[12:24:55] <stevemunene>	 !log restart jvm services on an-master1004 for T353776 and to pick up new JDK
[12:24:58] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:24:58] <stashbot>	 T353776: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776
[12:36:34] <stevemunene>	 !log failover hadoop namenode to an-master1004 for jvm service restart to pick up new JDK and T353776
[12:36:36] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:36:37] <stashbot>	 T353776: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776
[12:48:30] <stevemunene>	 !log restart jvm services on an-master1003 for T353776 and to pick up new JDK
[12:48:32] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:48:33] <stashbot>	 T353776: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776
[12:55:20] <wikibugs>	 10Data-Engineering, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula, 10netbox: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10Volans) 05Open→03Resolved @BTullis thanks for t...
[13:01:37] <stevemunene>	 !log failover hadoop namenode back to an-master1003 after the jvm service restart to pick up new JDK and T353776
[13:01:39] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:01:40] <stashbot>	 T353776: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776
[13:07:30] <icinga-wm>	 PROBLEM - Check systemd state on an-master1003 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:32] <icinga-wm>	 PROBLEM - Hadoop Namenode - Primary on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process
[13:08:27] <jinxer-wm>	 (SystemdUnitFailed) firing: (17) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:10:26] <icinga-wm>	 RECOVERY - Check systemd state on an-master1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:10:28] <icinga-wm>	 RECOVERY - Hadoop Namenode - Primary on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process
[13:11:40] <icinga-wm>	 RECOVERY - HDFS topology check on an-master1003 is OK: OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_topology_check
[13:13:27] <jinxer-wm>	 (SystemdUnitFailed) firing: (17) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:16:59] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 0.1815% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[13:23:27] <jinxer-wm>	 (SystemdUnitFailed) firing: (17) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:25:10] <icinga-wm>	 PROBLEM - Check systemd state on an-master1003 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:25:10] <icinga-wm>	 PROBLEM - Hadoop Namenode - Primary on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process
[13:29:01] <wikibugs>	 10Data-Engineering (Sprint 8), 10Patch-For-Review: [Iceberg Migration] Migrate session length tables to Iceberg - https://phabricator.wikimedia.org/T352672 (10CodeReviewBot) joal opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/608  Update analytics session_length DAG
[13:29:19] <wikibugs>	 (03PS43) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment of account creation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273)
[13:34:44] <joal>	 aqu: Hi! I need a review from you please, on patch just above --^
[13:39:27] <icinga-wm>	 RECOVERY - Hadoop Namenode - Primary on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process
[13:39:59] <icinga-wm>	 RECOVERY - Check systemd state on an-master1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:43:27] <jinxer-wm>	 (SystemdUnitFailed) firing: (17) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:47:15] <jinxer-wm>	 (HdfsFSImageAge) firing: The HDFS FSImage on analytics-hadoop:an-master1004:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[13:48:44] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene) The hosts are slowly balancing in the cluster and should help  with the low capacity warnings we were getting. {F41803475}  Namenodes services have also been restarted...
[13:52:15] <jinxer-wm>	 (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-hadoop:an-master1003:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[14:04:27] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Allow federated queries with the NFDI4Culture Knowledge Graph - https://phabricator.wikimedia.org/T346455 (10Loz.ross) Hi @RKemper,   Thanks so much for working on this ticket! We tried a query with my colleagues, but we're gett...
[14:13:27] <wikibugs>	 10Data-Engineering: Update data_quality schemas to be compatible with Iceberg tables - https://phabricator.wikimedia.org/T356866 (10JAllemandou)
[14:17:15] <jinxer-wm>	 (HdfsFSImageAge) firing: The HDFS FSImage on analytics-hadoop:an-master1004:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[14:22:15] <jinxer-wm>	 (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-hadoop:an-master1003:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[14:28:50] <wikibugs>	 10Data-Engineering, 10Data Products, 10Observability-Logging, 10Traffic, 10Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10Fabfur) Some updates about the ongoing work:  Currently our Benthos configuration produces this output, when fed with HAPr...
[14:36:31] <wikibugs>	 10Data-Engineering (Sprint 8), 10Patch-For-Review: Add `event.app_donor_experience` fields to event sanitization allowlist - https://phabricator.wikimedia.org/T356214 (10mpopov) 05Open→03Resolved a:03mpopov Thanks for merging & deploying,  @JAllemandou!  @SNowick_WMF: I just confirmed that `event_sanitiz...
[14:47:15] <jinxer-wm>	 (HdfsFSImageAge) resolved: The HDFS FSImage on analytics-hadoop:an-master1003:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[14:47:15] <jinxer-wm>	 (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-hadoop:an-master1003:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[14:52:15] <jinxer-wm>	 (HdfsFSImageAge) resolved: (2) The HDFS FSImage on analytics-hadoop:an-master1003:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[15:01:23] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: [superset-k8s] Find a solution for the requestctl-generator html page - https://phabricator.wikimedia.org/T356490 (10brouberol) I investigated how we could solve this, and we have 2 different ways I think we could re-implement the current behavior in k8s.  First, we cou...
[15:22:29] <wikibugs>	 (03CR) 10Gmodena: data-quality: rename source table column (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/997485 (https://phabricator.wikimedia.org/T356628) (owner: 10Gmodena)
[15:32:28] <wikibugs>	 10Data-Engineering: [Data Quality] Update data_quality schemas to be compatible with Iceberg tables - https://phabricator.wikimedia.org/T356866 (10Ahoelzl)
[15:33:36] <wikibugs>	 10Data-Engineering (Sprint 8): [Data Quality] Update data_quality schemas to be compatible with Iceberg tables - https://phabricator.wikimedia.org/T356866 (10Ahoelzl)
[15:35:30] <btullis>	 !log rolling out a change of the discovery-uri to presto workers and clients https://gerrit.wikimedia.org/r/c/operations/puppet/+/998425
[15:35:32] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:41:05] <jinxer-wm>	 (KafkaReplicationFactorTooLow) firing: ...
[15:41:05] <jinxer-wm>	 Kafka topic codfw.cpjobqueue.retry.mediawiki.job.RebuildMessageGroupStatsJob replication factor is too low on main-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?var-kafka_cluster=main-eqiad&var-kafka_broker=All&var-topic=codfw.cpjobqueue.retry.mediawiki.job.RebuildMessageGroupStatsJob&viewPanel=40 - ...
[15:41:05] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow
[15:46:03] <brouberol>	 is there a strong reason for this ^ topic ^ to not be replicated?
[15:46:05] <jinxer-wm>	 (KafkaReplicationFactorTooLow) resolved: ...
[15:46:05] <jinxer-wm>	 Kafka topic codfw.cpjobqueue.retry.mediawiki.job.RebuildMessageGroupStatsJob replication factor is too low on main-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?var-kafka_cluster=main-eqiad&var-kafka_broker=All&var-topic=codfw.cpjobqueue.retry.mediawiki.job.RebuildMessageGroupStatsJob&viewPanel=40 - ...
[15:46:05] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow
[15:47:29] <brouberol>	 I guess not hen
[15:47:31] <btullis>	 brouberol: I wouldn't have thought so, but then I'm not really sure. It's on the main cluster and it's something to do with the Mediwiki job queues, so it's probably a question for ServiceOps.
[15:47:31] <brouberol>	 *then
[15:47:59] <btullis>	 Did it fix itself, or was it a monitoring anomalomaly?
[15:48:19] <brouberol>	 I think that the topic was new and that it was a case of our monitoring not tolerating null series
[15:48:46] <brouberol>	 as the topic has a replication factor of 3
[15:54:57] <ottomata>	 is it possible the alert is misnamed? perhaps it is alerting on under replicated partitions?
[15:58:09] <brouberol>	 this monitor triggers when we create a topic with a replication factor = 1, it's not looking at the ISR size IIRC 
[15:58:32] <ottomata>	 weird okay
[15:59:11] <brouberol>	 I think it's a case of a topic being new and the data not "filling" the evaluation window
[15:59:25] <brouberol>	 I should probably use max instead of avg. I'll flag that for later
[15:59:33] <claime>	 hnowlan: _joe_ ^^
[15:59:43] <claime>	 possibly related to current jobqueue incident?
[16:00:15] <hnowlan>	 context: we've been seeing jobqueue errors since 14:55ish  https://logstash.wikimedia.org/goto/684a454f5135b7b7fdb695a19b0ec98d 
[16:00:17] <_joe_>	 the incident is with eventgate-main
[16:00:29] <_joe_>	 not the jobqueue, that is a consequence
[16:01:09] <_joe_>	 uhm that topic might be related to the script lucas launched?
[16:01:34] <_joe_>	 no the timing isn't right
[16:01:59] <brouberol>	 I looked in kafka, and the topic is fully replicated. I'd tend to file that under "monitoring flukes" except if there's something actively broken with it specifically
[16:03:01] <_joe_>	 brouberol: yeah, no, I think eventgate-main isn't that healthy though
[16:03:21] <_joe_>	 the service mesh is reporting connection errors since 14:55
[16:05:06] <brouberol>	 anything we can help with?
[16:06:57] <_joe_>	 brouberol: looking into eventgate-main? but right now monitoring seems dead, so I'd wait
[16:08:32] <joal>	 ping aqu on reviews please
[16:11:34] <_joe_>	 brouberol: nevermindd, the issues went away on our side a few minutes ago
[16:18:59] <brouberol>	 ack, 👍
[17:16:59] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 0.1132% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[17:18:27] <jinxer-wm>	 (SystemdUnitFailed) firing: (18) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:26:34] <btullis>	 !log roll-restarting kafka-jumbo for T356382
[17:26:36] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:28:06] <jinxer-wm>	 (KafkaReplicationFactorTooLow) firing: (20) Kafka topic codfw.android.breadcrumbs_event replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor  - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow
[17:31:08] <joal>	 Ping gmodena if you can: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/998482
[17:31:16] <joal>	 otherwise I'll do it myself :)
[17:33:06] <jinxer-wm>	 (KafkaReplicationFactorTooLow) resolved: (20) Kafka topic codfw.android.breadcrumbs_event replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor  - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow
[17:39:35] <hnowlan>	 as regards earlier - it looks like there were issues connecting to both eventgate-main and eventgate-analytics during the window of errors https://grafana.wikimedia.org/goto/2ap2i72Iz?orgId=1 
[17:44:05] <jinxer-wm>	 (KafkaReplicationFactorTooLow) firing: (55) Kafka topic DataHubUpgradeHistory_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor  - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow
[17:49:07] <jinxer-wm>	 (KafkaReplicationFactorTooLow) resolved: (55) Kafka topic DataHubUpgradeHistory_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor  - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow
[17:50:05] <jinxer-wm>	 (KafkaReplicationFactorTooLow) firing: (17) Kafka topic android replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor  - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow
[17:55:07] <jinxer-wm>	 (KafkaReplicationFactorTooLow) resolved: (72) Kafka topic DataHubUpgradeHistory_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor  - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow
[17:56:53] <jinxer-wm>	 (KafkaReplicationFactorTooLow) firing: (52) Kafka topic DataHubUpgradeHistory_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor  - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow
[18:01:39] <jinxer-wm>	 (KafkaReplicationFactorTooLow) resolved: (87) Kafka topic DataHubUpgradeHistory_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor  - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow
[18:10:51] <wmf-insecte>	 Starting build #137 for job analytics-refinery-maven-release-docker
[18:12:53] <jinxer-wm>	 (KafkaReplicationFactorTooLow) firing: (166) Kafka topic codfw.android.product_metrics.article_toolbar_interaction replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor  - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow
[18:13:35] <jinxer-wm>	 (KafkaReplicationFactorTooLow) resolved: (15) Kafka topic codfw.ios.setting_action replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor  - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow
[18:17:57] <jinxer-wm>	 (KafkaReplicationFactorTooLow) firing: (229) Kafka topic SpecialInvestigate replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor  - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow
[18:18:47] <jinxer-wm>	 (KafkaReplicationFactorTooLow) resolved: (229) Kafka topic SpecialInvestigate replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor  - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow
[18:25:42] <wikibugs>	 10Data-Engineering (Sprint 9): [Data Quality] Update data_quality schemas to be compatible with Iceberg tables - https://phabricator.wikimedia.org/T356866 (10Ahoelzl)
[18:27:25] <wmf-insecte>	 Project analytics-refinery-maven-release-docker build #137: 09SUCCESS in 16 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/137/
[18:35:04] <wikibugs>	 10Data-Platform-SRE, 10SRE-Access-Requests: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Dzahn) Should we close this ticket as "invalid"? It seems the best course of action might be a new ticket like "migrate all WMDE pipelines to airflow" an...
[18:35:47] <wikibugs>	 10Data-Platform-SRE, 10SRE-Access-Requests: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Dzahn) 05Open→03In progress
[18:49:07] <wikibugs>	 10Data-Platform-SRE, 10SRE-Access-Requests: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10mpopov) Not yet. I believe @AndrewTavis_WMDE will be sharing some findings from WMDE side soon.
[18:55:29] <wmf-insecte>	 Starting build #98 for job analytics-refinery-update-jars-docker
[18:55:55] <wikibugs>	 (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.32 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/998275
[18:55:56] <wmf-insecte>	 Project analytics-refinery-update-jars-docker build #98: 09SUCCESS in 26 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/98/
[19:08:28] <jinxer-wm>	 (SystemdUnitFailed) firing: (19) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:16:51] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10mpopov) Tagging DPE SRE in case this is specific to those tools.  @cchen: Can you please verify if you can `ssh` to the stat hosts and also use Jupyte...
[19:17:43] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): [Iceberg Migration] P.O.C. on Iceberg sensor using Postgres table to keep status of updates - https://phabricator.wikimedia.org/T340466 (10BTullis) @mforns - Would you like us to go ahead and create this database (or these databases) for you? H...
[19:17:50] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): [Iceberg Migration] P.O.C. on Iceberg sensor using Postgres table to keep status of updates - https://phabricator.wikimedia.org/T340466 (10BTullis) a:03BTullis
[19:18:51] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Bring stat1011 into service - https://phabricator.wikimedia.org/T354526 (10BTullis) Puppet is now running cleanly on stat1011, so I think that it is just a case of announcing it the the users and adding the docs to Wikitech.
[19:22:14] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10cchen) I `ssh` the stats machine and `kinit`, and got `Password incorrect while getting initial credentials. and I also tried JupyterHub, and it also...
[19:23:28] <jinxer-wm>	 (SystemdUnitFailed) firing: (20) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:24:39] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/998275 (owner: 10Maven-release-user)
[19:26:02] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/997960 (https://phabricator.wikimedia.org/T347879) (owner: 10Joal)
[19:28:28] <jinxer-wm>	 (SystemdUnitFailed) firing: (20) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:28:52] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) @cchen When you ran kinit the first time after you logged in, did it ask you to change the password? Did you get a new temporary one by mail?...
[19:36:02] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10cchen) @Dzahn Oh, I see. I found the email and reran the kinit with the temporary password, it works now.
[19:38:33] <wikibugs>	 10Data-Engineering (Sprint 8), 10Patch-For-Review: [BUG] webrequest analyzer DQ jobs fails to store data - https://phabricator.wikimedia.org/T356401 (10CodeReviewBot) gmodena opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/609  analytics: webrequest: version bump refine...
[19:43:28] <jinxer-wm>	 (SystemdUnitFailed) firing: (20) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:49:28] <joal>	 !log Release refinery-source v0.2.32
[19:49:30] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:49:38] <joal>	 !log Deployed refinery using scap
[19:49:39] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:49:49] <joal>	 !log deploying Refinery onto HDFS
[19:49:50] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:56:26] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin2002 for hosts: `cloudelastic1008.wikimedia.org` - cloudelastic1008.wikim...
[19:58:14] <wikibugs>	 10Data-Engineering (Sprint 8), 10Patch-For-Review: [Iceberg Migration] Migrate session length tables to Iceberg - https://phabricator.wikimedia.org/T352672 (10CodeReviewBot) joal merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/608  Update analytics session_length DAG
[20:01:23] <wikibugs>	 10Data-Engineering (Sprint 8), 10Patch-For-Review: [BUG] webrequest analyzer DQ jobs fails to store data - https://phabricator.wikimedia.org/T356401 (10CodeReviewBot) joal merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/609  analytics: webrequest_analyzer: version bump...
[20:08:28] <jinxer-wm>	 (SystemdUnitFailed) firing: (19) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:09:07] <joal>	 !log Relaunch druid_load_unique_devices_per_domain_daily_aggregated_monthly after deploy
[20:09:08] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:09:50] <ottomata>	 hnowlan: _joe_ eventgate-main and eventgate-analytics at the same time?  
[20:10:12] <ottomata>	 that's strange
[20:10:19] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) 05In progress→03Resolved Great! Feel free to reopen the ticket if there is anything else missing.
[20:17:38] <joal>	 Relaunch session_length_daily failed task
[20:17:42] <joal>	 !log Relaunch session_length_daily failed task
[20:17:43] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:17:48] <wikibugs>	 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 10 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Sbailey)
[20:25:38] <wikibugs>	 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 10 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Sbailey) Chromium render (Proton) upgraded from Node 12 to Node 18  [content transform] update to 2024-02-05-181957-pr...
[20:28:29] <jinxer-wm>	 (SystemdUnitFailed) firing: (20) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:58:28] <jinxer-wm>	 (SystemdUnitFailed) firing: (20) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:08:29] <jinxer-wm>	 (SystemdUnitFailed) firing: (20) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:16:59] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 1.572e-05% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[21:18:28] <jinxer-wm>	 (SystemdUnitFailed) firing: (20) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:28:14] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10mpopov) @cchen: How about Superset & Hue?
[21:28:28] <jinxer-wm>	 (SystemdUnitFailed) firing: (20) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:31:10] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10ops-eqiad: Comm Error: Backplane 0 on cloudelastic1008 - https://phabricator.wikimedia.org/T356919 (10bking)
[21:42:39] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10cchen) I still not able to access Superset & Hue, and i tried to reset  my password again, still not working.
[21:49:08] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10RhinosF1) You look still to be blocked on wikitech https://wikitech.wikimedia.org/wiki/Special:Contributions/Conniecc1 - not sure if that's related bu...
[21:49:12] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) 05Resolved→03Open
[21:49:22] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) a:05cchen→03None
[21:50:53] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10RhinosF1)
[21:52:12] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10RhinosF1) I've added a checklist based on the private task.  @MoritzMuehlenhoff (or another SRE): please update based on what already works  @cchen: i...
[23:04:53] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[23:08:28] <jinxer-wm>	 (SystemdUnitFailed) firing: (19) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:23:29] <jinxer-wm>	 (SystemdUnitFailed) firing: (20) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:59:31] <wikibugs>	 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Stale data/failed queries on wikidatawiki index - https://phabricator.wikimedia.org/T356941 (10bking)