[00:04:42] (SystemdUnitFailed) firing: matomo-archiver.service Failed on matomo1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:04:57] PROBLEM - Check unit status of matomo-archiver on matomo1002 is CRITICAL: CRITICAL: Status of the systemd unit matomo-archiver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:05:09] PROBLEM - Check systemd state on matomo1002 is CRITICAL: CRITICAL - degraded: The following units failed: matomo-archiver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:14:28] (DiskSpace) firing: (3) Disk space druid1009:9100:/srv 2.527% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:22:32] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1012.eqiad.wmnet with OS bullseye executed with errors: - aqs1012 (**FAIL**) - Removed from Puppet... [00:34:42] (SystemdUnitFailed) firing: (2) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:35:27] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:09:42] (SystemdUnitFailed) firing: (3) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:11:39] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1012.eqiad.wmnet with OS bullseye [01:22:20] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1012.eqiad.wmnet with OS bullseye executed with errors: - aqs1012 (**FAIL**) - Removed from Puppet... [02:09:42] (SystemdUnitFailed) firing: (3) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:39:28] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [03:44:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [03:58:56] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.301% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [04:14:29] (DiskSpace) firing: (3) Disk space druid1009:9100:/srv 0.002045% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [06:09:42] (SystemdUnitFailed) firing: (2) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:58:56] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.301% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:00:20] RECOVERY - Check unit status of matomo-archiver on matomo1002 is OK: OK: Status of the systemd unit matomo-archiver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:00:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [08:03:14] 10Data-Engineering, 10serviceops, 10Event-Platform, 10Patch-For-Review: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) [08:10:15] (EventgateValidationErrors) firing: ... [08:10:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:11:36] PROBLEM - Check unit status of matomo-archiver on matomo1002 is CRITICAL: CRITICAL: Status of the systemd unit matomo-archiver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:14:29] (DiskSpace) firing: (3) Disk space druid1009:9100:/srv 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:18:15] good morning [08:18:37] having a look at the druid alerts [08:19:30] * brouberol waves good morning [08:35:27] PROBLEM - Check systemd state on analytics1074 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:39:42] (SystemdUnitFailed) firing: (3) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:45:16] (EventgateValidationErrors) resolved: ... [08:45:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:46:54] (03PS32) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [08:48:21] PROBLEM - Check systemd state on analytics1075 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:42] (SystemdUnitFailed) firing: (4) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:52:15] (EventgateValidationErrors) firing: ... [08:52:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:54:52] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10CodeReviewBot) aqu merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/454 Update the schema registry used for airflow lineage in test [08:58:40] joal: bon matin, I have some questions about analytics/gobblin-wmf.git . In CI it is still targeting Java 8 and if that is no more needed I'd like to update it to use Java 11 [08:59:49] `mvn clean package` also fails to fetch some `eigenbase:eigenbase-properties` because its Maven repo is blocked (as i get it ) [09:02:56] stevemunene: the generated netboot.cfg is now deployed to production, alongside it the fix to the missing `echo` for `druid1009|druid101[01])` [09:03:13] Seems the newer druid-public servers have only 1.3Tb available in the /srv versus the older servers with 2.7T which is inline with the allocated maximum cache size of 2.5T here https://github.com/wikimedia/operations-puppet/blob/production/hieradata/role/common/druid/analytics/worker.yaml#L161 causing the space issues we are having on druid1009-11 [09:03:34] nice brouberol , thanks! [09:29:21] Oh, this is a little concerning about these druid servers. I will likely have approved these when they were specified, so the question is why is the disk space smaller? Is it fundamentally smaller storage capacity, or is it something to do with the RAID configuration of /srv? [09:33:05] 10Data-Engineering, 10Data-Platform-SRE, 10Foundational Technology Requests: Enable the Marketing Campaigns Reporting plugin for matomo - https://phabricator.wikimedia.org/T319013 (10Gehel) p:05Triage→03Medium [09:33:11] 10Data-Platform-SRE, 10sre-alert-triage: Alert triage: overdue alert [critical] The following units failed: wikidatardf-lexemes-dumps.service - https://phabricator.wikimedia.org/T343896 (10Gehel) p:05Triage→03High [09:33:21] RECOVERY - Check systemd state on analytics1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:50] 10Data-Platform-SRE: Project future physical host usage for Search Platform-owned services - https://phabricator.wikimedia.org/T350885 (10Gehel) p:05Triage→03Medium [09:34:52] 10Data-Platform-SRE: Project future physical host usage for Search Platform-owned services - https://phabricator.wikimedia.org/T350885 (10Gehel) a:03Gehel [09:35:50] 10Data-Platform-SRE: Check home/HDFS leftovers of ryanmax - https://phabricator.wikimedia.org/T325527 (10Gehel) p:05Triage→03Low [09:35:59] 10Data-Platform-SRE: Check home/HDFS leftovers of andyrussg - https://phabricator.wikimedia.org/T338234 (10Gehel) p:05Triage→03Low [09:36:25] 10Data-Platform-SRE: Check home/HDFS leftovers of ntsako - https://phabricator.wikimedia.org/T343189 (10BTullis) p:05Triage→03Low [09:36:27] 10Data-Platform-SRE: Check home/HDFS leftovers of aranyap - https://phabricator.wikimedia.org/T340945 (10BTullis) p:05Triage→03Low [09:36:37] 10Data-Platform-SRE: Check home/HDFS leftovers of paramd - https://phabricator.wikimedia.org/T340427 (10BTullis) p:05Triage→03Low [09:37:30] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10Gehel) p:05Triage→03High [09:38:02] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator - https://phabricator.wikimedia.org/T350784 (10Gehel) p:05Triage→03Medium [09:39:29] 10Data-Engineering, 10Data-Platform-SRE, 10Data Pipelines, 10Data-Platform: Figure out a way to automatize deployment of the spark assembly file - https://phabricator.wikimedia.org/T336513 (10Gehel) p:05Triage→03Medium [09:39:42] (SystemdUnitFailed) firing: (4) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:39:47] 10Data-Platform-SRE: Update spark warehouse configuration to use the same as Hive - https://phabricator.wikimedia.org/T349523 (10Gehel) p:05Triage→03Medium [09:43:51] RECOVERY - Check systemd state on analytics1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:44:52] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade schema hosts to bullseye - https://phabricator.wikimedia.org/T349286 (10Gehel) p:05Triage→03High [09:45:02] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10BTullis) p:05Triage→03High [09:45:04] 10Data-Engineering, 10Data-Platform-SRE: Migrate matomo to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349397 (10Gehel) p:05Triage→03High [09:45:06] 10Data-Platform-SRE: Migrate archiva to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349292 (10Gehel) p:05Triage→03High [09:45:08] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade eventlogging VM to bullseye (or bookworm) - https://phabricator.wikimedia.org/T349289 (10Gehel) p:05Triage→03High [09:45:10] 10Data-Platform-SRE: Decommission analytics10[70-77] - https://phabricator.wikimedia.org/T343763 (10BTullis) p:05Triage→03High [09:45:12] 10Data-Engineering, 10Data-Platform-SRE: Migrate yarn.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349399 (10Gehel) p:05Triage→03High [09:45:15] 10Data-Engineering, 10Data-Platform-SRE: Migrate hue.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349400 (10BTullis) p:05Triage→03High [09:45:17] 10Data-Engineering, 10Data-Platform-SRE, 10Data Products: Migrate an-web1001 to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349398 (10Gehel) p:05Triage→03High [09:45:26] 10Data-Platform-SRE: Upgrade hadoop master to bullseye - https://phabricator.wikimedia.org/T332573 (10BTullis) p:05Triage→03High [09:45:28] 10Data-Platform-SRE, 10Discovery-Search: Migrate MjoLniR deploy repo to Gitlab - https://phabricator.wikimedia.org/T350043 (10Gehel) p:05Triage→03Medium [09:45:39] 10Data-Platform-SRE: Bring stat1010 into service with GPU from stat1005 - https://phabricator.wikimedia.org/T336040 (10BTullis) p:05Triage→03Medium [09:46:15] 10Data-Platform-SRE: Migrate apifeatureusage hosts to Bullseye or later - https://phabricator.wikimedia.org/T346053 (10Gehel) p:05Triage→03High [09:46:30] 10Data-Platform-SRE: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10Gehel) p:05Triage→03High [09:46:38] 10Data-Platform-SRE: Refresh hadoop coordinators an-coord100[1-2] with an-coord[3-4] - https://phabricator.wikimedia.org/T332572 (10Gehel) p:05Triage→03High [09:47:08] 10Data-Engineering, 10Data-Engineering-Jupyter, 10Data-Platform-SRE, 10Product-Analytics: Remove anaconda-wmf package from the cluster - https://phabricator.wikimedia.org/T337963 (10Gehel) p:05Triage→03Low [09:48:51] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10Gehel) p:05High→03Medium [09:49:37] 10Data-Platform-SRE: Bring stat1010 into service with GPU from stat1005 - https://phabricator.wikimedia.org/T336040 (10Gehel) p:05Medium→03High [09:49:42] (SystemdUnitFailed) firing: (4) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:51:04] 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team, 10Patch-For-Review: Automate maintain-views replica depooling - https://phabricator.wikimedia.org/T300427 (10taavi) a:03taavi [09:51:09] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: [Data Platform] Stop and remove oozie services - https://phabricator.wikimedia.org/T341893 (10Gehel) p:05Medium→03High [09:52:52] 10Data-Platform-SRE: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10Gehel) a:03brouberol [09:54:29] 10Data-Engineering, 10Data-Platform-SRE, 10Product-Analytics, 10Wmfdata-Python, 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10Gehel) a:03BTullis [09:55:22] 10Data-Platform-SRE: Check log rotation settings on airflow instances - https://phabricator.wikimedia.org/T339015 (10Gehel) a:05Stevemunene→03None [09:55:28] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [09:56:36] probably the raid config of /srv btullis considering this is the line brouberol just fixed on the partman recipe [09:58:18] 10Data-Platform-SRE: Check log rotation settings on airflow instances - https://phabricator.wikimedia.org/T339015 (10Gehel) a:03Stevemunene [09:59:00] 10Data-Engineering, 10Data-Platform-SRE: Airflow scheduler and webserver logs should be readable by airflow instance admins - https://phabricator.wikimedia.org/T304615 (10Gehel) [10:00:13] stevemunene: o/ [10:00:24] 10Data-Engineering, 10Data-Platform-SRE: Airflow scheduler and webserver logs should be readable by airflow instance admins - https://phabricator.wikimedia.org/T304615 (10Gehel) @brouberol things might have changed and might already have been implemented since this ticket was created. There is also the open po... [10:00:28] o/ elukey [10:00:28] if you check lsblk -i on druid1009 there are four disks without partitions [10:00:56] and [10:00:57] elukey@druid1009:~$ sudo pvs PV VG Fmt Attr PSize PFree /dev/md0 vg0 lvm2 a-- <1.75t 357.54 [10:01:01] uff horrible paste [10:01:12] so /dev/md0 is a raid10 array with 4 disks [10:01:30] (see cat /proc/mdstat) [10:01:50] so I guess that the new hosts are not using all the disks [10:03:58] mmm but netboot should now have partman/raid10-8dev.cfg, so probably they need to be reimaged? [10:06:49] (if it was already known sorry I saw the last msgs and got curious :) [10:06:56] yes a reimage would be best [10:08:31] 10Analytics-Radar, 10Data-Engineering-Icebox, 10WMDE-Analytics-Engineering: wmde-toolkit-analyzer-build.service fails on stat1007 - https://phabricator.wikimedia.org/T278665 (10Manuel) [10:41:07] This is a reminder that I'm going to be putting HDFS into safe mode in 20 minutes' time, in order to carry out: T284150 [10:41:08] T284150: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 [10:43:01] !log temporarily disabled production jobs that write to HDFS [10:43:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:49:43] (SystemdUnitFailed) firing: (2) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:01:33] !log proceeding with the implementation plan here: https://phabricator.wikimedia.org/T284150#9330525 [11:01:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:01:43] !log entering HDFS safe mode [11:01:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:02:13] https://www.irccloud.com/pastebin/Bk9XB7XO/ [11:02:48] !log set an-coord1001 mysql to read_only [11:02:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:03:34] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) `MariaDB [(none)]> SET @@global.read_only=1; Query OK, 0 rows affected (0.000 sec) MariaDB [(none)]> FLUSH TABLES WITH READ LOCK; Query OK, 0 rows affected (0.040 sec) MariaDB... [11:04:56] !log position confirmed, resetting all slaves on an-mariadb1001 for T284150 [11:04:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:04:59] T284150: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 [11:06:30] !log merged all config files changes replacing an-coord1001 with an-mariadb1001 [11:06:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:22:24] !log exiting safe mode [11:22:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:25:27] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) I have issues the following on an-coord1001; ` MariaDB [(none)]> SHUTDOWN; Query OK, 0 rows affected (0.002 sec) MariaDB [(none)]> ` [11:27:24] PROBLEM - MariaDB Replica IO: analytics-meta-replica on an-coord1002 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@an-coord1001.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on an-coord1001.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:27:40] (DruidSegmentsUnavailable) firing: (5) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [11:28:41] (DruidSegmentsUnavailable) firing: (12) More than 10 segments have been unavailable for banner_activity_minutely on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [11:28:46] RECOVERY - MariaDB Replica IO: analytics-meta-replica on an-coord1002 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:30:49] Most things seem to be OK, but superset is complaining about a permissions issue. [11:33:40] (DruidSegmentsUnavailable) firing: (19) More than 10 segments have been unavailable for banner_activity_minutely on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [11:34:12] PROBLEM - analytics-meta MySQL instance on an-coord1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Mysql_Meta [11:34:16] PROBLEM - MariaDB Replica SQL: analytics-meta-replica on an-coord1002 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:35:34] RECOVERY - analytics-meta MySQL instance on an-coord1002 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Mysql_Meta [11:35:38] RECOVERY - MariaDB Replica SQL: analytics-meta-replica on an-coord1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:38:40] (DruidSegmentsUnavailable) firing: (19) More than 10 segments have been unavailable for banner_activity_minutely on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [11:40:39] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) There was a missing grant in the permissions table for superset. I had to add this: ` GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, REFERENCES, INDEX, ALTER, CREATE TEMPOR... [11:42:40] (DruidSegmentsUnavailable) resolved: (5) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [11:43:10] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) [11:43:41] (DruidSegmentsUnavailable) resolved: (19) More than 10 segments have been unavailable for banner_activity_minutely on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [11:49:43] (SystemdUnitFailed) firing: (5) gobblin-webrequest.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:58:56] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.301% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:59:43] (SystemdUnitFailed) firing: (5) gobblin-webrequest.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:02:48] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) Now monitoring for any stray traffic being sent to the mariadb service on an-coord1001 with the following: ` btullis@an-coord1001:~$ sudo tcpdump -i any dst port 3306 and dst ho... [12:14:29] (DiskSpace) firing: (3) Disk space druid1009:9100:/srv 1.145e-06% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:20:13] (DiskSpace) firing: Disk space an-druid1001:9100:/srv 1.056% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-druid1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:25:13] (DiskSpace) resolved: Disk space an-druid1001:9100:/srv 5.246% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-druid1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:40:59] (PuppetFailure) firing: Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:41:57] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) [12:49:43] (SystemdUnitFailed) firing: (4) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:52:31] (EventgateValidationErrors) firing: ... [12:52:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [12:55:51] 10Analytics-Radar, 10Data-Engineering-Icebox, 10WMDE-Analytics-Engineering, 10Wikidata: wmde-toolkit-analyzer-build.service fails on stat1007 - https://phabricator.wikimedia.org/T278665 (10Lydia_Pintscher) [13:04:59] (03PS1) 10Joal: Fix unique_devices iceberg insertion job - bis [analytics/refinery] - 10https://gerrit.wikimedia.org/r/974527 (https://phabricator.wikimedia.org/T350920) [13:16:36] Hi btullis - how has the Mysql change gone? [13:21:44] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:24:43] (SystemdUnitFailed) firing: (4) wmf_auto_restart_prometheus-mysqld-exporter.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:27:21] (03CR) 10Aqu: [C: 03+1] Fix unique_devices iceberg insertion job - bis [analytics/refinery] - 10https://gerrit.wikimedia.org/r/974527 (https://phabricator.wikimedia.org/T350920) (owner: 10Joal) [13:43:08] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/974527 (https://phabricator.wikimedia.org/T350920) (owner: 10Joal) [13:44:22] 10Data-Engineering, 10Data Pipelines: Wrong file names for 2 month files in pageview_complete/monthly - https://phabricator.wikimedia.org/T335685 (10hashar) 05Open→03Resolved a:03hashar https://dumps.wikimedia.org/other/pageview_complete/monthly/2023/2023-03/ ` ../ pageviews-202303-automated.bz2... [13:44:38] !log Deploying refinery for unique-devices hotfix [13:44:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:46:43] ! log Rerun airflow edit_hourly after fix deploy [13:48:17] 10Data-Engineering, 10Data Pipelines: Finding root cause of a second spike of text requests on Sept 8th - https://phabricator.wikimedia.org/T317396 (10hashar) 05Open→03Declined I don't think there is much reasons to investigate further. [13:51:21] joal: yes, all systems are now switched over to use mariadb1001 instead of an-coord1001. [13:52:08] btullis: you rock :) [13:52:30] btullis: I'm sorry not to have provded any feedback on the plan you wrote - it just looked good :) [13:53:32] interesting btullis - I just got an anknown error message from scap when deploying refinery [13:54:16] btullis: permission-error on host an-launcher1002 [13:55:57] joal: Oh, interesting. Do you want to look at it together? I'm not aware of anything that would have caused it. [13:56:03] sure! [13:56:06] And thanks <3 [13:56:10] batcave [14:00:57] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1012.eqiad.wmnet with OS bullseye [14:02:16] headsup, I'm going to reimage an-druid1003 [14:03:51] 10Data-Platform-SRE: Upgrade the druid-analytics cluster to bullseye - https://phabricator.wikimedia.org/T332604 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1001 for host an-druid1003.eqiad.wmnet with OS bullseye [14:08:23] as an-druid1003 has a zookeeper server running on it, I checked, and the zk leader is on an-druid1002. an-druid1001 has a follower, so we still have quorum [14:10:56] brouberol: ack - You're not backing up `/var/lib/zookeeper` or anything? Just allowing it to rejoin with no state and siscover wverything? [14:11:00] everything [14:12:17] (VarnishKafkaDeliveryErrors) firing: (6) varnishkafka has cache_upload errors on cp3074:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [14:12:21] (VarnishKafkaDeliveryErrors) firing: (5) varnishkafka has cache_text errors on cp3069:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [14:12:25] (VarnishKafkaDeliveryErrors) firing: (6) varnishkafka has cache_upload errors on cp3074:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [14:12:29] (VarnishKafkaDeliveryErrors) firing: (5) varnishkafka has cache_text errors on cp3069:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [14:12:52] (EventgateValidationErrors) resolved: ... [14:12:58] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [14:14:20] btullis: I'm keeping /srv, but if zookeeper had its data in /var/lib, then it'll get formatted away, and fetched from zk peers when the server comes back in the cluster [14:14:41] I doubt that the dataset size is > 2MB [14:15:08] brouberol: ack. Yeah, I'm pretty sure that's where it's kept. Should be fine. [14:17:06] (VarnishKafkaDeliveryErrors) resolved: (7) varnishkafka has cache_text errors on cp3067:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [14:17:10] (VarnishKafkaDeliveryErrors) resolved: (7) varnishkafka has cache_text errors on cp3067:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [14:17:14] (VarnishKafkaDeliveryErrors) resolved: (8) varnishkafka has cache_upload errors on cp3074:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [14:17:19] (VarnishKafkaDeliveryErrors) resolved: (8) varnishkafka has cache_upload errors on cp3074:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [14:21:53] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10MW-1.42-notes (1.42.0-wmf.5; 2023-11-14): Hard-deprecate mw.eventLog.inSample() - https://phabricator.wikimedia.org/T348776 (10phuedx) >>! In T348776#9327539, @phuedx wrote: > This is Done™. I'm leaving this task open to track monitoring the client-s... [14:24:45] (EventgateValidationErrors) firing: ... [14:24:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [14:30:22] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1012.eqiad.wmnet with OS bullseye executed with errors: - aqs1012 (**FAIL**) - Removed from Puppet... [14:31:15] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1012.eqiad.wmnet with OS bullseye [14:33:56] !log Deploy refinery onto HDFS (unique-devices hotfix) [14:34:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:35:52] (03CR) 10Xcollazo: "Ah, yes, because the source is not Iceberg! Thanks for the CC!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/974527 (https://phabricator.wikimedia.org/T350920) (owner: 10Joal) [14:36:40] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [14:42:11] the zk ensemble is back, each server having the full dataset [14:42:28] cool. Nice one. [14:43:11] although I'm seeing this in the cookbook logs: [7/15, retrying in 21.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal..check' raised: Not all services are recovered: an-druid1003:Zookeeper Alive Client Connections too high [14:44:32] the cookbook pools until icinga is in optimal state (with a timeout, best-effort) [14:45:04] understood. I'm just curious as to why we'd see too many connections on zk [14:45:34] it's NaN [14:45:39] aah [14:45:40] unknown in icinga status [14:45:52] anything that is not OK is considered as not optimal there [14:46:20] gotcha, thank you. I was seeing 3, 13 and 18 connections on these nodes, so nothing worrying [14:46:40] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [14:47:14] 10Data-Platform-SRE: Upgrade the druid-analytics cluster to bullseye - https://phabricator.wikimedia.org/T332604 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1001 for host an-druid1003.eqiad.wmnet with OS bullseye completed: - an-druid1003 (**WARN**) - Downtimed on Icin... [14:48:26] brouberol: would you have a moment to review this please? https://gerrit.wikimedia.org/r/c/operations/puppet/+/974516 <- cleanup on an-coord100[1-2] to tell them to forget about their mariadb instances. [14:48:42] how oftern is the check run? maybe has a slow frequency [14:48:49] and takes a bit to detect the new value [14:49:47] volans: unsure, but I'll have a look [14:49:51] btullis: sure thing [14:51:12] !log deployed refine using refinery-job 0.2.26 JsonSchemaConverter from wikimedia-event-utilities - https://phabricator.wikimedia.org/T321854 [14:51:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:56:59] (PuppetFailure) firing: Puppet has failed on dumpsdata1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:59:46] 10Data-Engineering (Sprint 5), 10Event-Platform, 10Patch-For-Review: [Event Platform] change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10Ottomata) [15:00:10] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) We got an error from puppet running on an-coord1001 because I hadn't changed the locaiton of the oozie database server. ` Error: '/usr/lib/oozie/bin/ooziedb.sh create -run' retu... [15:00:19] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) [15:00:59] (PuppetFailure) firing: (5) Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:06:59] (PuppetFailure) resolved: Puppet has failed on dumpsdata1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:10:59] (PuppetFailure) firing: (5) Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:16:06] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1012.eqiad.wmnet with OS bullseye executed with errors: - aqs1012 (**FAIL**) - Removed from Puppet... [15:16:37] 10Data-Engineering-Radar, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Product Sprint 04): Non-deterministic unit test "streamInSample() - session sampling resets" - https://phabricator.wikimedia.org/T304379 (10phuedx) p:05Triage→03Medium [15:35:59] (PuppetFailure) firing: (3) Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:36:26] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: [Data Platform] Stop and remove oozie services - https://phabricator.wikimedia.org/T341893 (10BTullis) a:03BTullis [15:43:37] 10Data-Platform-SRE: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene) We encountered some challenges with the decommissioning. The newer servers druid10[09-11] due to an issue with the RAID config (which has been resolved) have a smaller /srv partition at 1.3T as opposed to... [15:45:59] (PuppetFailure) firing: (3) Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:58:56] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.301% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:14:29] (DiskSpace) firing: (3) Disk space druid1009:9100:/srv 1.145e-06% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:20:59] (PuppetFailure) firing: (2) Puppet has failed on an-coord1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:21:59] (PuppetZeroResources) firing: Puppet has failed generate resources on an-worker1127:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:22:59] (PuppetZeroResources) firing: Puppet has failed generate resources on kafka-jumbo1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:25:59] (PuppetFailure) firing: (4) Puppet has failed on an-coord1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:33:03] (PuppetZeroResources) resolved: Puppet has failed generate resources on kafka-jumbo1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:35:59] (PuppetFailure) firing: (4) Puppet has failed on an-coord1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:36:49] seems related to a missing file in hiera https://puppetboard.wikimedia.org/report/kafka-jumbo1007.eqiad.wmnet/770c3547fdc4e3ba554ee0178a7bad9b302958ac [16:36:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on an-worker1127:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:49:03] PROBLEM - Oozie Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.catalina.startup.Bootstrap https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie [16:52:43] 10Data-Engineering-Radar, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Product Sprint 04): Non-deterministic unit test "streamInSample() - session sampling resets" - https://phabricator.wikimedia.org/T304379 (10mpopov) Dang, I wish there was a way to set a seed in... [16:58:27] ^ That oozie server can be ignored it's me working on T341893 and accidentally tripping the alert. [16:58:27] T341893: [Data Platform] Stop and remove oozie services - https://phabricator.wikimedia.org/T341893 [17:01:56] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [17:04:46] (EventgateValidationErrors) resolved: ... [17:04:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [17:15:59] (PuppetFailure) firing: (2) Puppet has failed on an-coord1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:22:23] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Stop and remove oozie services - https://phabricator.wikimedia.org/T341893 (10BTullis) [17:24:57] 10Data-Engineering: NEW BUG REPORT Some DAG run attempts fail because File *_temporary/0 does not exist. - https://phabricator.wikimedia.org/T347076 (10JAllemandou) Super interesting finding! Tl;DR: No cross-job data issues, but potential failures when running parallel spark jobs onto the same table. I have do... [17:24:58] (SystemdUnitFailed) firing: (4) wmf_auto_restart_prometheus-mysqld-exporter.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:25:32] 10Data-Engineering (Sprint 5), 10Data-Platform, 10Movement-Insights: Iceberg unique devices table reporting incorrect numbers for 2023-10-01 - https://phabricator.wikimedia.org/T350920 (10JAllemandou) This has been corrected - data should be ok now. Sorry for the inconvenience. [17:26:05] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:26:12] 10Data-Engineering, 10Data-Platform-SRE, 10Privacy Engineering, 10SecTeam-Processed: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10sbassett) >>! In T349910#9325309, @sguebo_WMF wrote: > In light of the above, **the privacy risk associated with the two proposed chang... [17:29:43] (SystemdUnitFailed) firing: (4) wmf_auto_restart_prometheus-mysqld-exporter.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:44:43] (SystemdUnitFailed) firing: (5) wmf_auto_restart_prometheus-mysqld-exporter.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:44:59] PROBLEM - Check systemd state on kafka-jumbo1011 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:02:06] 10Data-Engineering, 10Observability-Logging, 10Traffic: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10Ottomata) > moved all traffic to HAProxy ...We did?! Wow. Can you link some other tasks so I can get some context? [18:03:22] 10Data-Platform-SRE, 10Elasticsearch, 10Discovery-Search (Current work): Change partitioning scheme for elasticsearch from RAID to JBOD - https://phabricator.wikimedia.org/T231010 (10bking) [18:03:26] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) 05Resolved→03In progress a:05Jclark-ctr→03bking [18:05:43] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) Reopening as cloudelastic1008-1010 don't appear to have reimaged properly, and we may need them for T350826 . [18:11:48] 10Data-Engineering, 10Tech-Docs-Team, 10Goal: Redesign Data Platform docs on Wikitech - https://phabricator.wikimedia.org/T350911 (10TBurmeister) [18:12:30] 10Data-Engineering, 10Tech-Docs-Team, 10Goal: Redesign Data Platform docs on Wikitech - https://phabricator.wikimedia.org/T350911 (10TBurmeister) [18:12:53] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1008.wikimedia.org with OS bullseye [18:15:08] 10Data-Engineering, 10Tech-Docs-Team, 10Goal: Redesign Data Platform docs on Wikitech - https://phabricator.wikimedia.org/T350911 (10TBurmeister) [18:15:11] 10Data-Engineering, 10Documentation: User-centric documentation links - https://phabricator.wikimedia.org/T329550 (10TBurmeister) [18:15:14] 10Data-Platform-SRE: Update the GeoIP databases for matomo to use the same as the production pipelines - https://phabricator.wikimedia.org/T351242 (10BTullis) I have ascertained that we already have the MaxMind GeoIP databases installed to matmo1002. They use a [[https://github.com/wikimedia/operations-puppet/bl... [18:15:17] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1009.wikimedia.org with OS bullseye [18:15:30] 10Data-Platform-SRE: Update the GeoIP databases for matomo to use the same as the production pipelines - https://phabricator.wikimedia.org/T351242 (10BTullis) [18:16:30] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye [18:18:16] (EventgateValidationErrors) firing: ... [18:18:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [18:41:19] RECOVERY - Check systemd state on kafka-jumbo1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:44:43] (SystemdUnitFailed) firing: (4) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:50:59] (PuppetFailure) resolved: Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:00:45] (03PS1) 10Clare Ming: Add custom schemas for 2 Android article instruments [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/974674 (https://phabricator.wikimedia.org/T351292) [19:01:29] (03CR) 10CI reject: [V: 04-1] Add custom schemas for 2 Android article instruments [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/974674 (https://phabricator.wikimedia.org/T351292) (owner: 10Clare Ming) [19:03:25] (03CR) 10Clare Ming: "recheck" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/974674 (https://phabricator.wikimedia.org/T351292) (owner: 10Clare Ming) [19:07:12] (03PS2) 10Clare Ming: Add custom schemas for 2 Android article instruments [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/974674 (https://phabricator.wikimedia.org/T351292) [19:19:43] RECOVERY - Check systemd state on matomo1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:24:43] (SystemdUnitFailed) firing: (3) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:25:49] 10Data-Platform-SRE: Update the GeoIP databases for matomo to use the same as the production pipelines - https://phabricator.wikimedia.org/T351242 (10BTullis) I believe that these databases are now in use. We can see a real-time map of visits here: {F41508591} We can also see the settings for the GeoIP2 plugin h... [19:25:55] 10Data-Platform-SRE: Update the GeoIP databases for matomo to use the same as the production pipelines - https://phabricator.wikimedia.org/T351242 (10BTullis) [19:26:01] RECOVERY - Check unit status of matomo-archiver on matomo1002 is OK: OK: Status of the systemd unit matomo-archiver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:33:16] (EventgateValidationErrors) resolved: ... [19:33:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [19:34:44] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1008.wikimedia.org with OS bullseye executed with errors:... [19:36:17] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1009.wikimedia.org with OS bullseye executed with errors:... [19:37:30] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye executed with errors:... [19:45:46] (EventgateValidationErrors) firing: ... [19:45:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [19:58:56] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.301% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:00:45] (EventgateValidationErrors) resolved: ... [20:00:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:06:46] (EventgateValidationErrors) firing: ... [20:06:47] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:14:29] (DiskSpace) firing: (3) Disk space druid1009:9100:/srv 1.145e-06% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:31:46] (EventgateValidationErrors) resolved: ... [20:31:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:41:59] (PuppetFailure) firing: Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:42:22] !log Ran 'DROP TABLE wmf_dumps.wikitext_raw_rc0' and 'DROP TABLE wmf_dumps.wikitext_raw_rc1' to delete older release candidate tables. [20:42:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:43:39] !log Ran 'sudo -u analytics hdfs dfs -rm -r -skipTrash /wmf/data/wmf_dumps/wikitext_raw_rc0' to delete HDFS data of old release candidate table [20:43:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:43:54] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10bking) Apologies for the reimage spam, it's from an unrelated operation. [20:44:05] Ran 'sudo -u analytics hdfs dfs -rm -r -skipTrash /user/hive/warehouse/wmf_dumps.db/wikitext_raw_rc1' to delete HDFS data of old release candidate table [20:44:16] !log Ran 'sudo -u analytics hdfs dfs -rm -r -skipTrash /user/hive/warehouse/wmf_dumps.db/wikitext_raw_rc1' to delete HDFS data of old release candidate table [20:44:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:48:50] 10Data-Engineering, 10Observability-Logging, 10Traffic: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10Fabfur) Sure! The main task was https://phabricator.wikimedia.org/T323557 [20:59:08] 10Data-Platform-SRE, 10Elasticsearch, 10Discovery-Search (Current work): Change partitioning scheme for elasticsearch from RAID to JBOD - https://phabricator.wikimedia.org/T231010 (10bking) [20:59:15] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) 05In progress→03Resolved Not sure what happened, but the cloudelastic1008-1010 hosts are up after a reim... [21:06:24] 10Data-Engineering (Sprint 5): [Data Quality] [Needs Grooming] Collect requirements to define prioritized data pipeline and data metrics - https://phabricator.wikimedia.org/T350409 (10Ahoelzl) A list of data incidents is collected here: https://docs.google.com/document/d/1UsjUJqFnMg9zaaGeJuvLIwAZE-WafgrQPTwhCckp... [21:17:35] 10Data-Platform-SRE, 10Discovery-Search (Current work): Create dashboards/alerts for new Cirrus Streaming Updater - https://phabricator.wikimedia.org/T349772 (10bking) [21:35:49] 10Data-Platform-SRE: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10bking) [21:42:24] 10Data-Engineering, 10Tech-Docs-Team, 10Goal: Define dataset documentation strategy - https://phabricator.wikimedia.org/T349103 (10TBurmeister) [22:00:31] 10Data-Engineering (Sprint 5), 10Data-Platform, 10Movement-Insights: Iceberg unique devices table reporting incorrect numbers for 2023-10-01 - https://phabricator.wikimedia.org/T350920 (10Mayakp.wiki) p:05Triage→03High [22:07:31] 10Data-Platform-SRE: Decommission search-loader1001/2001 VMs - https://phabricator.wikimedia.org/T351123 (10bking) 05Open→03Resolved These VMs have been fully deleted/decommissioned. Closing... [22:07:34] 10Data-Platform-SRE, 10Discovery-Search (Current work): Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking) [22:11:00] 10Data-Platform-SRE: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10RKemper) [22:15:32] 10Data-Platform-SRE: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10RKemper) [22:34:36] (03CR) 10Clare Ming: [C: 03+2] Add sampling configuration to /analytics/mediawiki/client/metrics_event [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/973729 (https://phabricator.wikimedia.org/T350495) (owner: 10Phuedx) [22:35:09] (03Merged) 10jenkins-bot: Add sampling configuration to /analytics/mediawiki/client/metrics_event [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/973729 (https://phabricator.wikimedia.org/T350495) (owner: 10Phuedx) [22:49:43] (SystemdUnitFailed) firing: (3) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:04:20] 10Data-Platform-SRE, 10Patch-For-Review: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host cloudelastic1008.wikimedia.org with OS bullseye [23:04:43] (SystemdUnitFailed) firing: (3) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:09:19] 10Data-Engineering: NEW BUG REPORT Some DAG run attempts fail because File *_temporary/0 does not exist. - https://phabricator.wikimedia.org/T347076 (10mpopov) Thank you for the great investigation, Joseph! Phew, I'm glad to learn that there's no risk of data contamination. Oh hey, looks like we're not alone ht... [23:58:57] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.3% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace