[00:48:13] (SystemdUnitFailed) firing: export_smart_data_dump.service Failed on analytics1077:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:50:09] PROBLEM - Check systemd state on analytics1077 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:46:33] RECOVERY - Check systemd state on analytics1077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:13] (SystemdUnitFailed) resolved: export_smart_data_dump.service Failed on analytics1077:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:25:42] (SystemdUnitFailed) firing: monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:27:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:38:24] 10Data-Platform-SRE: Restart Elasticsearch services for java 11 updates - https://phabricator.wikimedia.org/T350703 (10MoritzMuehlenhoff) I've also extended this task to cover the restarts for WCSQ and WQDS, I've just rolled out the respective Java 8 security updates. Also, for cloudelastic* I'm still seeing so... [07:39:05] 10Data-Platform-SRE: Restart Elasticsearch services for java 11 updates - https://phabricator.wikimedia.org/T350703 (10MoritzMuehlenhoff) [07:39:51] 10Data-Platform-SRE: Restart Elasticsearch services for Java 8/11 updates - https://phabricator.wikimedia.org/T350703 (10MoritzMuehlenhoff) [07:47:15] (EventgateValidationErrors) firing: ... [07:47:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [07:56:13] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.327% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:12:21] PROBLEM - Checks that the airflow database for airflow wmde is working properly on an-airflow1007 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow db check did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [08:13:13] PROBLEM - Checks that the local airflow scheduler for airflow @wmde is working properly on an-airflow1007 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1007.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [08:13:15] PROBLEM - Check systemd state on an-airflow1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-textfile-prometheus-check-certificate-expiry.service,wmf_auto_restart_airflow-scheduler@wmde.service,wmf_auto_restart_airflow-webserver@wmde.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:52] (SystemdUnitFailed) firing: (4) prometheus-node-textfile-prometheus-check-certificate-expiry.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:53] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:24:01] good morning folks - btullis, stevemunene or brouberol, would anyone of you folks be there? [08:25:04] good morning joal [08:25:14] heya stevemunene [08:25:39] stevemunene: could you run a command for us on the stat1007 machine please (see https://phabricator.wikimedia.org/T350252#9307957) [08:25:53] sure np, having a look [08:25:58] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ecc0e9fa-3af2-4029-86fe-4b42f82adef6) set by stevemunene@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their service... [08:26:02] joal: morning! I'm here as well if needed [08:26:14] thanks brouberol :) [08:33:16] responded to the ticket joal [08:33:26] thanks a lot stevemunene :) [08:34:53] stevemunene: sorry to bother - the demand was about cron jobs, not timers :S [08:38:32] oops my bad [08:42:27] seems theres none joal https://www.irccloud.com/pastebin/xrKoe0xg/ [08:49:41] 10Data-Platform-SRE: Restart Elasticsearch services for Java 8/11 updates - https://phabricator.wikimedia.org/T350703 (10dcausse) @MoritzMuehlenhoff is it possible to delay the restart of blazegraph on nodes running an data import: wdqs1022, wdqs1023 and 1024 (T347504)? [08:56:35] 10Data-Platform-SRE: Restart Elasticsearch services for Java 8/11 updates - https://phabricator.wikimedia.org/T350703 (10dcausse) [08:57:52] No worries stevemunene - awesome - can you please paste this in the ticket? [08:58:15] already done [08:58:43] thank you! [09:00:30] 10Data-Platform-SRE: Restart Elasticsearch services for Java 8/11 updates - https://phabricator.wikimedia.org/T350703 (10MoritzMuehlenhoff) >>! In T350703#9318464, @dcausse wrote: > @MoritzMuehlenhoff is it possible to delay the restart of blazegraph on nodes running an data import: wdqs1022, wdqs1023 and 1024... [09:16:49] I'm seeing 500 errors in toast messages on datahub.wikimedia.org. Am I the only one? [09:19:14] actually not just in toasts. https://datahub.wikimedia.org/search?filter_platform=urn%3Ali%3AdataPlatform%3Akafka displays a "Something went wrong" error [09:24:30] I'm seeing tracebacks in the datahub-gms-main-55f97c45d4-z7rtt pod logs for datahub-codfw [09:24:43] org.elasticsearch.ElasticsearchStatusException: Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed] [09:25:05] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"[simple_query_string] analyzer [query_urn_component] not found","index":"datasetindex_v2_1699458847040","index_uuid":"er-HDnriTJGwmYdYMOXMzQ"}],"type":"search_phase_execution_exception","reason":"all shards [09:25:05] failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"datasetindex_v2_1699458847040","node":"50uKScxYREyFVlCeM4NwpQ","reason":{"type":"query_shard_exception","reason":"[simple_query_string] analyzer [query_urn_component] not found","index":"datasetindex_v2_1699458847040","index_uuid":"er-HDnriTJGwmYdYMOXMzQ"}}]},"status":400} [09:25:14] can anyone help with that? [09:25:44] Oh, can we check on the status of datahubsearch100[1-3] please? I'm 3 minutes from my desk [09:26:46] We could try a redeploy, or if that doesn't work we can rebuild the indices completely. [09:28:08] all 3 opensearch_1@datahub.service services are running [09:29:25] http://localhost:9200/_cluster/health reports green status [09:29:32] (on datahub1003) [09:35:04] OK, I'm at my desk now. Let's go for a redeploy, shall we? Do you want to pair/triplet on it? [09:36:20] I've seen a similar error before, but that was before the upgrade to 0.,10 and we didn't have the migration scripts running then, nor the command available to rebuild the opensearch indices. [09:38:37] can I let you take that? I'm in my 1/1 w/ gehel [09:38:44] Yep, will do. [09:43:03] !log executed `helmfile -e codfw --state-values-set roll_restart=1 sync` to roll-restart datahub in codfw [09:43:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:48:53] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:49:50] PROBLEM - Check systemd state on an-worker1118 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:37] !log executed `helmfile -e eqiad --state-values-set roll_restart=1 sync` to roll-restart datahub in eqiad [09:53:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:55:04] datahub seems to be back! [09:56:14] Great! The redeploy worked then. So for reference, I did this: https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_restart [09:56:42] do we have a sense of why did a redeploy fixed the indices? [09:58:12] No, not exactly. There are some Kubernetes Job objects, which are set to run on each deploy. This is the first: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/datahub/templates/datahub-upgrade/datahub-system-update-job.yml [09:58:36] That runs pre-deploy. [09:59:00] Then this is the second one, which runs after the deploy: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/datahub/templates/datahub-upgrade/datahub-nocode-migration-job.yml [10:00:48] I was tailing the logs with e.g. `kubectl logs -f datahub-main-system-update-job-6fj7h`and it creates a lot of indices, but it's not clear to me why the index that we were supposed to be using (datasetindex_v2_1699458847040) wasn't found earlier. [10:02:27] This is the CronJob template for an ad-hoc command what can be run at any time to rebuild the indices completely: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/datahub/templates/datahub-upgrade/datahub-restore-indices-job-template.yml#5 [10:03:10] Thanks, that makes sense! [10:03:35] And last question: do we have (want?) anything in place to catch up these ju [10:03:42] *kinds of error before a human does? [10:08:15] Ideally, yes. But it'll probably be part of a wider 'Productionalisating DataHub' effort, since we're only inching out of its MVP stage. I think that the most useful thing we could do in the short term is to migrate it to the DSE cluster instead of wikikube, so that it's not deployed to codfw and is therefore faster. The fact that MariaDB and Opensearch and Kafka and Karapace are all in eqiad, but the active DataHub frontend is [10:08:15] in codfw at the moment is really noticeable. [10:23:32] (03PS22) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [10:24:06] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [10:24:30] (03CR) 10Cyndywikime: "Done" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [10:34:15] joal: Are you around at the moment, by any chance? [10:35:04] btullis: Hello :) [10:35:54] Hello. I'd like to deploy this puppet patch today possible, enabling the multiple spark shufflers in production: https://gerrit.wikimedia.org/r/c/operations/puppet/+/964008/ [10:36:44] I would just really appreciate having you around to make sure that things are OK. [10:37:01] If you don't think it's a good time, I can defer too. [10:39:10] RECOVERY - Checks that the airflow database for airflow wmde is working properly on an-airflow1007 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow db check succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [10:41:01] (03PS23) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [10:41:29] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [10:42:03] btullis: I'm there until evening, please go ahead :) [10:46:52] RECOVERY - Check systemd state on an-worker1118 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:52] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:50:07] 10Data-Platform-SRE: Restart Elasticsearch services for Java 2023-11-08 updates - https://phabricator.wikimedia.org/T350703 (10Aklapper) [11:02:12] RECOVERY - Checks that the local airflow scheduler for airflow @wmde is working properly on an-airflow1007 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1007.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [11:03:32] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) [11:11:32] joal: Thanks so much. I will start that now. [11:11:55] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) [11:11:59] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) We have made progress with the WMDE airflow instance, merged and implemented all the setup PRs that were pending and initialized the DB as below; Check connection ` ste... [11:14:43] !log deploying multiple spark shufflers to production for T344910 [11:14:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:14:46] T344910: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 [11:18:06] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) Deploying this change with: `sudo cumin P:hadoop::common run-puppet-agent` I will remove the existing symlink `/usr/lib/had... [11:35:14] 10Data-Platform-SRE: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene) [11:35:47] 10Data-Platform-SRE: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene) [11:46:25] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) Puppet has now run on all nodes. All of the symlinks are present. ` btullis@cumin1001:~$ sudo cumin A:hadoop-worker 'find /u... [11:47:31] !log restarting yarn-nodemanager service on an-worker1100.eqiad.wmnet as a canary for T344910 [11:47:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:47:34] T344910: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 [11:48:53] (EventgateValidationErrors) firing: ... [11:48:53] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [11:49:16] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) The startup log for the hadoop-yarn-nodemanager log looks clean. ` btullis@an-worker1100:~$ tail -f /var/log/hadoop-yarn/yar... [11:57:28] 10Data-Platform-SRE: Restart Elasticsearch services for Java 8 / Java 11 security updates - https://phabricator.wikimedia.org/T350703 (10MoritzMuehlenhoff) [11:58:52] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.327% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:00:52] joal: Status update, I've restarted the nodemanager on one worker (an-worker1100) - logs look good so far, but I thought I'd wait for a container to run on it successfully before doing a rolling restart of all of the remaining 85 workers. Anything else you can think of to test before proceeding? [12:17:09] Looks like I might be waiting for quite a while before a container gets scheduled on this worker. https://usercontent.irccloud-cdn.com/file/WhcBsnqk/image.png [12:20:10] I think it's to do with the node labels. All of these 6 hosts with the GPU labels seem not to have containers most of the time. [12:20:12] https://usercontent.irccloud-cdn.com/file/zIuJqzbV/image.png [12:20:31] Heya btullis - I think you can roll out a slow pace - we'll failing apps if an [12:20:36] anything goes on [12:21:24] OK, great. I think we can also look at removing these node labels, which might give us more capacity. If I remember correctly, we removed GPUs from most, if not all, of these hadoop workers. [12:22:21] ack btullis - removing labels would be fine [12:22:43] I don't think the hosts don't get used, but possibly less than others? weird [12:25:00] OK, I'll start restarting the nodemanagers sequentially with about 15 seconds between each. Should be about 20 minutes to roll-restart them all. Sound ok? [12:25:26] 10Quarry, 10Data-Services, 10cloud-services-team (FY2023/2024-Q1): Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407 (10fnegri) [12:28:37] btullis: 15secs feels small, I'd have gone for more (maybe 30secs or even 1min), but you're the master :) [12:29:16] !log Proceeding to roll-restart yarn nodemanagers with `sudo cumin A:hadoop-worker -b 1 -s 30 'systemctl restart hadoop-yarn-nodemanager.service'` for T344910 [12:29:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:29:20] T344910: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 [12:29:25] joal: You're too kind <3 [12:35:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on an-mariadb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [13:35:23] stevemunene: I'm about ready to reimage an-druid1005. Are you familiar with druid at all? I'm looking for signals/metrics that would tell me that the server is back and 100% caught up [13:35:34] kind of like kafka's under replicated partitions count [13:35:43] do you know about anything like that? Thanks! [13:35:59] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on an-mariadb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [13:47:01] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) All hadoop nodemanagers have been restarted. ` btullis@cumin1001:~$ sudo cumin -b 1 -s 30 A:hadoop-worker 'systemctl restart... [13:47:07] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) [13:48:15] The rolling restart of the hadoop nodemanagers has completed. No drama so far. Now we have three spark shuffler versions running in parallel. [13:48:32] woo [13:48:35] congrats! [13:48:43] brouberol: Not too familiar, but mainly the https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Druid#Coordinators_Administration_UI [13:49:44] right, I have it displayed via ssh tunneling. I think t [13:50:07] *that the /unified-console.html#datasources page actually tells me what I want, aka the availability of the segments [13:50:32] meaning that if some segments are unavailable, the host hasn't caught up with the data yet [13:51:23] btullis: congrats on making this happen :) This will prove useful for sure! thanks a milion [13:53:14] joal: Thank you <3 It feels good to get to this point. I've got a few cleanup patches to write now, before I forget. [13:54:14] btullis: before you do :D I'm ready to reimage an-druid1005, but I'm looking for pointers from someone more knowledgeable about it than myself. Is there anything that I should be looking at beforehands? [13:56:32] Sure thing. Shall we look together. People like joal know the innards better than I do, but I can show you the workings in case it helps :-) [13:57:17] https://meet.google.com/rxb-bjxn-nip <- this link is colloquially called the batcave. [13:57:47] it really would! I've been looking at the UI, some docs, etc, but it feels like I'm looking at the empty Photoshop UI for now. It somehow makes sense, but I'm not _really_ understanding anything [14:01:09] joining as well [14:22:48] o/ inflatador we're in the batcave above --^ [14:25:14] 10Data-Platform-SRE: Restart Search Platform-owned services for Java 8 / Java 11 security updates - https://phabricator.wikimedia.org/T350703 (10bking) [14:35:01] headsup, I'm going to reimage an-druid1005 [14:38:53] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:38:54] PROBLEM - Check systemd state on datahubsearch1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:08] RECOVERY - Check systemd state on datahubsearch1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:25] !log pooled druid10[09-11] in the druid-public cluster. [14:43:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:43:53] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:44:08] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Product-Analytics, 10SRE, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) [14:46:34] 10Data-Platform-SRE: Upgrade the druid-analytics cluster to bullseye - https://phabricator.wikimedia.org/T332604 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1001 for host an-druid1005.eqiad.wmnet with OS bullseye [14:53:53] (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:59:43] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities - https://phabricator.wikimedia.org/T321854 (10Ottomata) a:05Ottomata→03Non... [15:03:40] Fix the dumbest typo in the world https://gerrit.wikimedia.org/r/c/operations/puppet/+/973178 /facepalms [15:46:44] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4): [Maintenance] Understand and inventory change-propagation use cases, deployments, and custom business logic - https://phabricator.wikimedia.org/T350156 (10Ottomata) [15:48:53] (EventgateValidationErrors) firing: ... [15:48:53] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [15:53:11] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) K, sent to wikitech-l. [15:53:53] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:54:56] 10Data-Platform-SRE: Project future physical host usage for Search Platform-owned services - https://phabricator.wikimedia.org/T350885 (10bking) [15:58:53] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.327% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:03:59] (PuppetFailure) firing: Puppet has failed on analytics1071:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:08:59] (PuppetFailure) firing: (2) Puppet has failed on an-worker1097:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:09:23] (EventgateValidationErrors) resolved: ... [16:09:23] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:11:01] 10Data-Engineering, 10serviceops, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata) @joe's suggestion: > my suggestion with envo... [16:13:53] (EventgateValidationErrors) firing: ... [16:13:53] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:13:59] (PuppetFailure) firing: (2) Puppet has failed on an-worker1097:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:14:23] (EventgateValidationErrors) resolved: ... [16:14:23] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:18:59] (PuppetFailure) resolved: (2) Puppet has failed on an-worker1097:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:20:45] (EventgateValidationErrors) firing: ... [16:20:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:23:53] (EventgateValidationErrors) resolved: ... [16:23:53] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:34:23] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:42:23] 10Data-Engineering, 10serviceops, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata) Ah, nope. I spoke too soon: eventstreams: h... [16:43:45] (EventgateValidationErrors) firing: ... [16:43:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:43:53] (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:44:23] (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:48:53] (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:49:23] (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:55:33] 10Data-Engineering, 10serviceops, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata) Very similar sounding issue: https://github.... [17:03:53] (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:07:36] 10Data-Engineering (Sprint 5), 10Data Engineering and Event Platform Team (Sprint 4): [Data Quality] Calculate and log post processing metrics for webrequests - https://phabricator.wikimedia.org/T349456 (10lbowmaker) [17:08:45] (EventgateValidationErrors) resolved: ... [17:08:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [17:08:53] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:14:56] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities - https://phabricator.wikimedia.org/T321854 (10Ottomata) a:03Ottomata [17:15:57] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10VRiley-WMF) @BTullis we have received the ram modules. Could you let us know a good time to power the servers down and install them? Let us know, thanks! [17:18:04] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10BTullis) Hi @VRiley-WMF - Many thanks. These servers are not doing anything at the moment, so they can be powered them down any time you like. Would you like me to p... [17:22:54] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10VRiley-WMF) @BTullis I'll go ahead and install them and update the ticket once it's done. Just wanted to verify it would be okay to proceed. [17:23:29] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10BTullis) Awesome. Many thanks. [17:27:48] 10Data-Platform-SRE: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene) >>! In T336043#9315689, @BTullis wrote: > I would also look at preparing the patch to [[https://wikitech.wikimedia.org/wiki/LVS#Deploy_a_change_to_an_existing_service|remove the three hosts from LVS]] ahe... [17:28:09] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10JAllemandou) Yay! Thanks so much @VRiley-WMF and @BTullis :) [17:33:53] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:47:38] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10VRiley-WMF) [17:48:01] 10Data-Platform-SRE: Restart Search Platform-owned services for Java 8 / Java 11 security updates - https://phabricator.wikimedia.org/T350703 (10bking) [18:14:15] (EventgateValidationErrors) firing: ... [18:14:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [18:19:23] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:32:45] 10Data-Engineering, 10WMDE-Fundraising-Tech, 10Event-Platform, 10WMDE-FUN-Funban-2023: Validation Error for eventlogging_WMDEBannerSizeIssue - https://phabricator.wikimedia.org/T344027 (10kai.nissen) [18:33:53] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:34:23] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:48:53] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:48:59] 10Data-Engineering, 10WMDE-FUN-Team, 10WMDE-Fundraising-Tech, 10Event-Platform, 10WMDE-FUN-Funban-2023: Validation Error for eventlogging_WMDEBannerSizeIssue - https://phabricator.wikimedia.org/T344027 (10kai.nissen) Seems to occasionally happen across all banners ([see dashboard](https://logstash.wikime... [18:50:00] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10AntiCompositeNumber) I would suggest waiting an additional week before enabl... [19:30:54] (03CR) 10Clare Ming: [C: 03+1] "this lgtm - will give it a bit to see if anyone else pipes up before merging" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968714 (https://phabricator.wikimedia.org/T349729) (owner: 10Kimberly Sarabia) [19:37:08] 10Data-Engineering, 10Tech-Docs-Team, 10Goal: Define dataset documentation strategy - https://phabricator.wikimedia.org/T349103 (10TBurmeister) [19:49:23] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:51:50] 10Data-Engineering, 10Tech-Docs-Team, 10Goal: Define dataset documentation strategy - https://phabricator.wikimedia.org/T349103 (10TBurmeister) Met with more stakeholders and product owners / directors; determined that the first project to be prioritized in this area is: https://phabricator.wikimedia.org/T35... [19:58:53] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.327% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:00:19] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Quiddity) Re: User-notice - Thank you for the draft-wording, that is greatly... [20:00:27] 10Data-Engineering, 10serviceops, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata) Added a retry_policy for 5xx, but still getti... [20:03:53] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:06:31] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) > Pywikibot link from github to mediawiki-wiki +1, thank you. > I... [20:17:06] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Quiddity) >>! In T266798#9320909, @Ottomata wrote: > What do you mean by 'it... [20:20:00] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) I think it would be good to include it in the earliest edition we... [20:38:53] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:48:53] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:53:52] 10Data-Engineering, 10serviceops, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata) Related, I think: {T263... [20:54:50] 10Data-Platform-SRE: Upgrade the druid-analytics cluster to bullseye - https://phabricator.wikimedia.org/T332604 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1001 for host an-druid1005.eqiad.wmnet with OS bullseye completed: - an-druid1005 (**WARN**) - Downtimed on Icin... [21:04:16] (EventgateValidationErrors) resolved: ... [21:04:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [23:58:53] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.327% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace