[00:30:35] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:19] (SystemdUnitFailed) firing: (9) user-runtime-dir@24065.service Failed on an-test-client1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:30:36] (SystemdUnitFailed) firing: (9) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:32:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:30:36] (SystemdUnitFailed) firing: (9) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:47:44] 10Data-Engineering, 10Data-Platform-SRE, 10Dumps-Generation, 10SRE, 10Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228 (10LSobanski) [07:34:30] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Users in archiva-deployer group can't upload artifacts anymore. - https://phabricator.wikimedia.org/T355352 (10brouberol) I've `ssh` -ed onto `archiva1002.wikimedia.org` and had a look at the archiva logs (`/var/log/archiva/ I'm seei... [07:35:12] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Users in archiva-deployer group can't upload artifacts anymore. - https://phabricator.wikimedia.org/T355352 (10brouberol) I'm also seeing the following traceback in `/var/log/archiva/wrapper.log` ` INFO | jvm 1 | 2024/01/22 02:3... [07:38:25] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10ayounsi) We got a diffscan alert as those servers are running on public IPs and new ports are exposed to the diffscan cloudVM. After a quick look it seems like those serv... [07:39:56] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Users in archiva-deployer group can't upload artifacts anymore. - https://phabricator.wikimedia.org/T355352 (10brouberol) What I'm seeing points to an issue with `git-fat`, and I'll continue investigating after baby-duty. [07:42:16] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Users in archiva-deployer group can't upload artifacts anymore. - https://phabricator.wikimedia.org/T355352 (10brouberol) Ah wait, no, `git-fat` is a oneshot service, so it's *supposed* to start regularly ` brouberol@archiva1002:~$ su... [07:55:36] (SystemdUnitFailed) firing: (11) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:57:43] 10Data-Engineering: Check home/HDFS leftovers of shubhankar - https://phabricator.wikimedia.org/T355501 (10MoritzMuehlenhoff) [07:59:19] (SystemdUnitFailed) firing: (12) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:36:19] Hello, SREs! I wanted to check the alert about "hadoop-namenode-backup-hdfs". But I don't have access to an-master1002 anymore (Permission denied). It's been a long time since I have accessed this machine, and I usually don't need it. But I think it could be handy in case of emergency or debugging. Like any other data engineer. What do you think? [08:49:29] (03PS9) 10TChin: Add iceberg version of aqs_hourly table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/982869 (https://phabricator.wikimedia.org/T352669) [08:53:12] (03PS3) 10TChin: Add iceberg version of interlanguage_navigation table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/986839 (https://phabricator.wikimedia.org/T352671) [09:03:36] 10Data-Engineering, 10DBA, 10Data-Services, 10TaxonBot, and 2 others: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10taavi) [09:04:26] 10Data-Engineering, 10Data-Persistence, 10Data-Services, 10cloud-services-team, 10Patch-For-Review: Wiki-replicas: investigate why some maintenance operations can cause unwanted pybal impact - https://phabricator.wikimedia.org/T337721 (10taavi) 05Open→03Invalid Pybal is no longer used here. [09:23:54] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 07): Add and export MetricsClient#isStreamInSample() - https://phabricator.wikimedia.org/T352966 (10SGupta-WMF) Related gitlab MR https://gitlab.wikimedia.org/repos/data-engineering/me... [09:24:27] aqu: Apologies, that alert can be ignored. I was supposed to disable the check last week. an-master1002 has been made ready for decommissioning and replaced with an-master1004 under T332573 [09:24:28] T332573: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 [09:25:27] Ohhh I get it. Thanks for the explanation. [09:26:22] Sorry about leaving the check. jo.al pointed it out to me last week too on the data-engineering-alerts list, but I simply forgot to go back and manually remove the check. [09:26:39] I'll be decommissioning an-master100[1-2] this week, all being well. [09:42:35] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) We have been getting some emails from systemd timers that were inadvertently left enabled on an-master1002. I have disabled them with: ` btu... [09:46:50] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 07): Add and export MetricsClient#isStreamInSample() - https://phabricator.wikimedia.org/T352966 (10phuedx) @SGupta-WMF: Gerrit is back up :) [09:52:15] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [10:00:50] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Users in archiva-deployer group can't upload artifacts anymore. - https://phabricator.wikimedia.org/T355352 (10brouberol) All my previous comments related to git-fat were invalid. When digging in the Archiva UI,... [10:02:25] 10Data-Engineering (Sprint 7): [Iceberg Migration] Define sensor concept and implementation plan - https://phabricator.wikimedia.org/T354695 (10Antoine_Quhen) @BTullis , @brouberol , @Stevemunene I would like your feedback on this subject: We are going to setup a new system `status-store` which would describe e... [10:02:43] 10Data-Engineering, 10Release-Engineering-Team, 10Gerrit (Gerrit 3.7), 10Upstream: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10hashar) 05Stalled→03Resolved a:03hashar I went to leave it as is over the week-end knowing upgrading Gerrit was on t... [10:13:38] (03CR) 10Joal: [WIP] Rewrite x-analytics header hash parsing (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/991805 (owner: 10Aqu) [10:46:01] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Users in archiva-deployer group can't upload artifacts anymore. - https://phabricator.wikimedia.org/T355352 (10brouberol) a:03brouberol [11:08:35] (DiskSpace) firing: Disk space an-worker1113:9100:/var/lib/hadoop/data/b 5.967% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1113 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:10:36] (SystemdUnitFailed) firing: (21) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:14:51] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on an-master1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:28:40] 10Data-Engineering, 10Release-Engineering-Team, 10Gerrit (Gerrit 3.7), 10Upstream: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10gmodena) >>! In T355173#9475961, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=ht... [13:40:36] 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2), 10CirrusSearch, 10Discovery-Search, 10serviceops: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10Gehel) [13:40:40] 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2), 10Discovery-Search (Current work): Cirrus-streaming-updater test: validate relforge indices are correctly updated - https://phabricator.wikimedia.org/T350186 (10Gehel) [13:40:44] 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2), 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10Gehel) [13:41:29] 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2), 10VPS-project-Codesearch, 10Patch-For-Review: Add all Data Engineering gitlab repositories to codesearch - https://phabricator.wikimedia.org/T355069 (10Gehel) [13:41:32] 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Create 3 microsites for wdqs full graph, main graph, & scholarly articles - https://phabricator.wikimedia.org/T354658 (10Gehel) [13:41:47] 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Gehel) [13:42:28] 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2), 10Wikidata, 10Discovery-Search (Current work): Generate TLS certs for new WDQS endpoints - https://phabricator.wikimedia.org/T354661 (10Gehel) [13:42:35] 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2), 10Patch-For-Review: Create helmfile deployment files for superset and superset-next - https://phabricator.wikimedia.org/T353790 (10Gehel) [13:42:37] 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2), 10observability: Change data platform-related IRC channels to improve communication - https://phabricator.wikimedia.org/T352783 (10Gehel) [13:42:40] 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2), 10Discovery-Search (Current work): Create dashboards for Search SLOs - https://phabricator.wikimedia.org/T338009 (10Gehel) [13:42:42] 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2), 10Patch-For-Review: Create a helm chart for Superset - https://phabricator.wikimedia.org/T352166 (10Gehel) [13:42:44] 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2): Refactor sre.wdqs.data-transfer to use new spicerack class api - https://phabricator.wikimedia.org/T347624 (10Gehel) [13:42:46] 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10Gehel) [13:42:49] 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10Gehel) [13:42:51] 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10Gehel) [13:42:53] 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2): Check log rotation settings on airflow instances - https://phabricator.wikimedia.org/T339015 (10Gehel) [13:43:34] (DiskSpace) resolved: Disk space an-worker1113:9100:/var/lib/hadoop/data/b 5.964% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1113 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [13:44:13] 10Data-Platform-SRE, 10GitLab (Project Migration): Migrate Elasticsearch plugins repo to gitlab - https://phabricator.wikimedia.org/T353275 (10Gehel) [13:45:33] 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2), 10CirrusSearch, 10Discovery-Search (Current work): SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 (10Gehel) [13:45:45] 10Data-Engineering, 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2): analytics/refinery scap deploy on test cluster fails with permission error - https://phabricator.wikimedia.org/T354703 (10Gehel) [13:46:06] 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2), 10Wikidata, 10Discovery-Search (Current work): Create DNS records for 3 new WDQS endpoints - https://phabricator.wikimedia.org/T354662 (10Gehel) [13:46:37] 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2), 10Discovery-Search (Current work): Investigate connection timeouts between Search Update Pipeline and MediaWiki APIs - https://phabricator.wikimedia.org/T354289 (10Gehel) [13:46:53] 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2), 10Discovery-Search (Current work): Investigate connection timeouts between Search Update Pipeline and MediaWiki APIs - https://phabricator.wikimedia.org/T354289 (10Gehel) p:05Triage→03High [13:47:54] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2): Users in archiva-deployer group can't upload artifacts anymore. - https://phabricator.wikimedia.org/T355352 (10Gehel) [13:50:24] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE: [Iceberg Migration] Define sensor concept and implementation plan - https://phabricator.wikimedia.org/T354695 (10Gehel) [13:51:50] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Iceberg Migration] Migrate interlanguage tables to Iceberg - https://phabricator.wikimedia.org/T352671 (10tchin) Using `lz4` compression works but checking it with `parquet-tools` doesn't. I see something like `compression: UNKNOWN (space_saved: -25%)` [[ htt... [13:52:15] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [13:53:04] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE: [Iceberg Migration] Define sensor concept and implementation plan - https://phabricator.wikimedia.org/T354695 (10BTullis) Hi @Antoine_Quhen - we would be happy to support this. We have a couple of shared database platforms that we use to provide state storag... [14:02:28] 10Data-Engineering, 10Data-Platform-SRE: Check home/HDFS leftovers of shubhankar - https://phabricator.wikimedia.org/T355501 (10lbowmaker) [14:03:18] 10Data-Engineering, 10Data-Platform-SRE: Check home/HDFS leftovers of shubhankar - https://phabricator.wikimedia.org/T355501 (10Gehel) p:05Triage→03Low [14:06:19] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2): Users in archiva-deployer group can't upload artifacts anymore. - https://phabricator.wikimedia.org/T355352 (10brouberol) It turns out the issue was in LDAP/Archiva. The `archiva-deployers` had an extra `cn` attribute with value `cn=... [14:06:36] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2): Users in archiva-deployer group can't upload artifacts anymore. - https://phabricator.wikimedia.org/T355352 (10brouberol) 05Open→03Resolved [14:12:31] 10Data-Engineering: [Dataset Config Store] - Proof of Concept - https://phabricator.wikimedia.org/T355542 (10lbowmaker) [14:14:34] (DiskSpace) firing: Disk space an-worker1113:9100:/var/lib/hadoop/data/b 5.999% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1113 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [14:27:08] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10CirrusSearch, 10Discovery-Search, 10serviceops: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10Gehel) p:05Triage→03High [14:27:26] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10Gehel) p:05Triage→03Medium [14:27:44] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10VPS-project-Codesearch, 10Patch-For-Review: Add all Data Engineering gitlab repositories to codesearch - https://phabricator.wikimedia.org/T355069 (10Gehel) p:05Triage→03Low [14:28:43] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Discovery-Search (Current work): Generate TLS certs for new WDQS endpoints - https://phabricator.wikimedia.org/T354661 (10Gehel) p:05Triage→03High [14:29:13] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Discovery-Search (Current work): Create DNS records for 3 new WDQS endpoints - https://phabricator.wikimedia.org/T354662 (10Gehel) p:05Triage→03High [14:29:16] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10CirrusSearch, 10Discovery-Search (Current work): SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 (10Gehel) p:05Triage→03High [14:36:39] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Users in archiva-deployer group can't upload artifacts anymore. - https://phabricator.wikimedia.org/T355352 (10BTullis) In case it helps, there is a long Slack thread about this incident here: https://wikimedia.slack.com/archives/C05... [14:40:39] 10Data-Platform-SRE (2024.01.01 - 2024.01.21): Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10bking) @ayounsi Thanks for the link. We're in the process of rolling out new hosts and unfortunately, we reused the existing puppet code without much thought about public... [14:43:11] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Migrate yarn.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349399 (10brouberol) a:03brouberol [14:43:38] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Products: Migrate an-web1001 to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349398 (10brouberol) [14:45:22] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 (10BTullis) [14:45:58] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Migrate hue.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349400 (10BTullis) [14:48:07] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10BTullis) [14:48:09] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Bring stat1011 into service - https://phabricator.wikimedia.org/T354526 (10BTullis) [14:48:11] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Bring stat1010 into service with GPU from stat1005 - https://phabricator.wikimedia.org/T336040 (10BTullis) [14:48:13] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10BTullis) [14:48:15] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware: decommission druid1006.eqiad.wmnet - https://phabricator.wikimedia.org/T354743 (10BTullis) [14:48:17] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware: decommission druid1005.eqiad.wmnet - https://phabricator.wikimedia.org/T354742 (10BTullis) [14:48:19] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware: decommission druid1004.eqiad.wmnet - https://phabricator.wikimedia.org/T354741 (10BTullis) [14:53:08] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Draft a kafka upgrade plan for all the WMF clusters - https://phabricator.wikimedia.org/T355550 (10brouberol) [14:54:11] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Draft a kafka upgrade plan for all the WMF clusters - https://phabricator.wikimedia.org/T355550 (10brouberol) I've started drafting a Kafka upgrade plan, for both the clusters and related clients: https://docs.google.com/document/d/1eHqkgKZitERH3M4NkJPA3qW3qWqF4haIk... [14:56:29] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Draft a kafka upgrade plan for all the WMF clusters - https://phabricator.wikimedia.org/T355550 (10brouberol) [14:56:33] 10Data-Engineering, 10Data-Platform-SRE, 10SRE, 10observability, and 3 others: Upgrade Kafka to from 1.x to later version - https://phabricator.wikimedia.org/T300102 (10brouberol) [14:57:44] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Decom an-master100[1-2] - https://phabricator.wikimedia.org/T353775 (10BTullis) [14:57:55] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Gehel) p:05Triage→03Medium [14:58:02] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Bring stat1011 into service - https://phabricator.wikimedia.org/T354526 (10Gehel) p:05Triage→03High [15:14:22] (SystemdUnitFailed) firing: (21) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:14:51] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on an-master1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:29:01] 10Data-Engineering, 10Data Products (Data Products Sprint 07): Data Quality Issue: Wikitext History Job fail / rerun in Airflow - https://phabricator.wikimedia.org/T342911 (10mforns) a:05mforns→03None [15:45:52] 10Analytics, 10Data-Engineering, 10Pageviews-API: Yearly endpoint for the /pageviews/top API - https://phabricator.wikimedia.org/T154381 (10mforns) @VirginiaPoundstonem just pinging you from this task in case you want to prioritize it. There is this use case in the pageviews tool (yearly top article views),... [16:10:18] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Discovery-Search (Current work): Cirrus-streaming-updater test: validate relforge indices are correctly updated - https://phabricator.wikimedia.org/T350186 (10Gehel) a:05pfischer→03EBernhardson [16:13:16] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10Gehel) a:05bking→03dcausse [16:19:06] 10Data-Platform-SRE: ProbeDown - https://phabricator.wikimedia.org/T355272 (10Gehel) [16:19:34] (DiskSpace) resolved: Disk space an-worker1113:9100:/var/lib/hadoop/data/b 5.988% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1113 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:24:38] 10Data-Platform-SRE (2024.01.01 - 2024.01.21): Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10taavi) Somewhat related: {T346946} [16:26:41] 10Data-Platform-SRE (2024.01.01 - 2024.01.21): Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10bking) @taavi Indeed, I was thinking of that one too. I'll post an update there. [16:34:49] 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: move cloudelastic behind cloudlb - https://phabricator.wikimedia.org/T346946 (10bking) @taavi a few questions to clarify scope and amount of work required, since we've already been asked to [[ https://phabricator.wikimedia.org/T351354#9475546 | mov... [17:10:01] 10Data-Engineering, 10Data Products, 10Observability-Logging, 10Traffic, 10Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10VirginiaPoundstone) [17:44:37] 10Data-Platform-SRE (2024.01.01 - 2024.01.21): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors: - elastic2088 (**FA... [17:46:42] 10Data-Platform-SRE (2024.01.01 - 2024.01.21): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye [17:52:15] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [18:42:39] (03PS1) 10Santiago Faci: [DNM] Update the WikiLambda instrumentation to use core interaction events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/992224 (https://phabricator.wikimedia.org/T350497) [18:43:08] (03CR) 10CI reject: [V: 04-1] [DNM] Update the WikiLambda instrumentation to use core interaction events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/992224 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci) [18:49:34] (DiskSpace) firing: Disk space an-worker1113:9100:/var/lib/hadoop/data/b 5.986% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1113 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:06:57] 10Data-Platform-SRE (2024.01.01 - 2024.01.21): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors: - elastic2088 (**FA... [19:15:08] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on an-master1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [19:15:36] (SystemdUnitFailed) firing: (21) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:24:13] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Discovery-Search (Current work): Create dashboards for Search SLOs - https://phabricator.wikimedia.org/T338009 (10RKemper) Finished the documentation. With the new dashboard up in https://grafana-rw.wikimedia.org/d/xiWr1c5Iz/search-slos?orgId=1, this work is com... [19:26:44] 10Data-Platform-SRE, 10Discovery-Search: Migrate Search SLOs to prometheus based metrics - https://phabricator.wikimedia.org/T355589 (10RKemper) [19:27:49] 10Data-Platform-SRE, 10Discovery-Search: Migrate Search SLOs to prometheus based metrics - https://phabricator.wikimedia.org/T355589 (10RKemper) Should be moved to Blocked / Waiting. However for now I think I need to leave it in incoming until it's been triaged by the Search Platform team. [19:28:08] 10Data-Platform-SRE, 10Discovery-Search: Migrate Search SLOs to prometheus based metrics - https://phabricator.wikimedia.org/T355589 (10RKemper) [19:49:09] (03PS6) 10Mforns: Add query to load MediaWiki snapshot to Cassandra AQS config table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/989558 (https://phabricator.wikimedia.org/T352948) [19:59:24] 10Data-Platform-SRE: Re-generate webserver-misc-apps.discovery.wmnet cergen certificate - https://phabricator.wikimedia.org/T355593 (10RKemper) [20:02:00] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Discovery-Search (Current work): Generate TLS certs for new WDQS endpoints - https://phabricator.wikimedia.org/T354661 (10RKemper) These 3 new services have their internal certs working with Envoy. Moving to Done and spun off https://phabricator.wi... [20:02:25] 10Data-Platform-SRE: Re-generate webserver-misc-apps.discovery.wmnet cergen certificate - https://phabricator.wikimedia.org/T355593 (10RKemper) [20:04:44] 10Data-Platform-SRE, 10Abstract Wikipedia team, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for Wikifunctions.org (new public content wiki) - https://phabricator.wikimedia.org/T289316 (10Jdforrester-WMF) 05Resolved→03In progress [20:05:54] 10Data-Platform-SRE, 10Abstract Wikipedia team, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for Wikifunctions.org (new public content wiki) - https://phabricator.wikimedia.org/T289316 (10Jdforrester-WMF) 05In progress→03Resolved [21:52:16] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [22:39:35] (DiskSpace) resolved: Disk space an-worker1113:9100:/var/lib/hadoop/data/b 5.865% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1113 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:42:15] (HdfsCapacityRemainingPercent) resolved: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [22:48:15] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [23:00:18] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10observability: Change data platform-related IRC channels to improve communication - https://phabricator.wikimedia.org/T352783 (10bking) [23:00:49] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10observability: Change data platform-related IRC channels to improve communication - https://phabricator.wikimedia.org/T352783 (10bking) [23:02:03] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10observability: Change data platform-related IRC channels to improve communication - https://phabricator.wikimedia.org/T352783 (10bking) [23:08:37] 10Data-Platform-SRE: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10bking) [23:09:03] 10Data-Platform-SRE: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10bking) [23:15:08] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on an-master1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:15:36] (SystemdUnitFailed) firing: (21) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed