[00:02:49] (SystemdUnitFailed) firing: (2) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:49] (SystemdUnitFailed) resolved: monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:32:49] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:33:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:45:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:47:49] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:27:49] (SystemdUnitFailed) firing: monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:28:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:23:07] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for tlywiki - https://phabricator.wikimedia.org/T345169 (10Marostegui) @btullis can you give this some priority? It's been sitting here for a while and {T349424} is blocked on it. [06:54:14] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10gmodena) >>! In T349118#9285852, @Jdforrester-WMF wrote: >>>! In T349118#9285601, @gmodena wrote: >> **Data Engineering... [08:19:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on druid1009:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [08:27:49] (SystemdUnitFailed) firing: monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:30:00] (PuppetConstantChange) firing: (3) Puppet performing a change on every puppet run on druid1006:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [08:40:00] (PuppetConstantChange) firing: (4) Puppet performing a change on every puppet run on druid1006:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [08:46:42] 10Data-Engineering, 10serviceops, 10Event-Platform: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) Found a way to generate Flame Graphs: * nodejs 10 version: {F40463091} * nodejs 18 version: {F40462916} Procedure: * Added the following to nodejs `--perf-basic... [08:55:36] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Metrics, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10fgiunchedi) >>! In T343232#9285062, @BTullis wrote: > ...but I don't... [09:32:49] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:47:49] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:53:13] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.154% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:53:58] ^ uh-oh. We had better have a look at /srv/ on an-web1001. [09:58:30] 10Data-Platform-SRE: /srv on an-web1001 is low on disk space - https://phabricator.wikimedia.org/T349889 (10BTullis) [09:58:52] 10Data-Platform-SRE: /srv on an-web1001 is low on disk space - https://phabricator.wikimedia.org/T349889 (10BTullis) p:05Triage→03High [09:59:34] It's continuing to fill up pretty rapidly, so there is only 50 GB left. [09:59:39] https://www.irccloud.com/pastebin/ISViBFsD/ [10:03:43] I put out a message on Slack as well, in #data-engineering-collab : https://wikimedia.slack.com/archives/CSV483812/p1698400981595519 asking if anyone is in the middle of publishing anything. [10:05:29] 10Data-Platform-SRE: /srv on an-web1001 is low on disk space - https://phabricator.wikimedia.org/T349889 (10BTullis) Ah good, it's levelled out at 97% and does not seem to be increasing. {F40477027,width=50%} [10:09:03] PROBLEM - Disk space on an-web1001 is CRITICAL: DISK CRITICAL - free space: /srv 46656 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-web1001&var-datasource=eqiad+prometheus/ops [10:10:39] I have submitted a 7 day silence for the alert. It looks like it has levelled out at 97% which is good for now. [10:29:43] btullis: I have inspected a bit, and there is one dataset that weights more than 500Gb [10:29:52] btullis: I guess we should investigate thiso ne [10:30:08] btullis: /srv/analytics.wikimedia.org/published/datasets/one-off/caption_competition [10:30:15] 10Data-Platform-SRE: /srv on an-web1001 is low on disk space - https://phabricator.wikimedia.org/T349889 (10BTullis) Not good. It is tracking up again. {F40479749,width=50%} I started a conversation about it on Slace here: https://wikimedia.slack.com/archives/CSV483812/p1698400981595519 I'm running commands lik... [10:30:55] OK, let's wait one miinute joal, I think I might have found 41 GB that might get freed up in a sec. [10:31:17] btullis: Yeah, but we're talking small in compraison to those 500Gb :) [10:32:17] Oh definitely, we should absolutely look at the 500 GB dataset. I'm only talking about trying to avoid hitting 100% temporarily. [10:32:28] yeah I hear you [10:33:34] These are new source files on stat1007 that I think are causing us trouble. [10:33:38] https://www.irccloud.com/pastebin/4uuo0zg5/ [10:34:12] Right, I saw that paragon has a big folder as well (almost 80Gb) [10:34:33] We might be ok, because I think that it will get to 49 GB here, then unlink the 41 GB file. [10:34:37] https://www.irccloud.com/pastebin/1Kn2U5Id/ [10:34:58] ...but then it will probably try to rsync the new 33 GB file as well. [10:35:43] No, I don't think there's space. [10:35:46] https://www.irccloud.com/pastebin/aNKlPV4v/ [10:37:16] So thinking about it, this will likely fail, delete the 46 GB temporary file, continue to the 33 GB file, then throw an alert and try again soon. [10:37:51] Maybe we *should* do something about that 500 GB dataset that you mentioned now. [10:38:44] btullis: I pinged fab on the slack thread, he's the one having generated the data [10:39:08] Ah cool. Thanks. I wondered who it was. [10:40:05] btullis: paragon has 82Gb to be synced from stat1007, and it seems to be done [10:40:28] Not according to this: [10:40:32] https://www.irccloud.com/pastebin/Z4r9EQyr/ [10:40:43] btullis: hopefully the sync process will finish successfully and we'll be able to drop some big stuff soon? [10:41:11] Yeah, there is also this, but it's small fry. [10:41:15] https://www.irccloud.com/pastebin/kurZZ8XT/ [10:43:08] https://www.irccloud.com/pastebin/MdcFApOe/ [10:44:03] yeah - the big ones (more than 50Gb) are paragon with ~90Gb (those files currently being published), the caption_competition one ~500Gb (from fab), santosh with ~60Gb, ladsgroup with 94Gb and diego with 54Gb [10:47:07] I could delete the existing file (two hardlinks) and then it would probably succeed, because it wouldn't have to have a temporary copy whilst rsyncing the new one. [10:47:12] https://www.irccloud.com/pastebin/PVuEl4jo/ [10:47:37] What do you think? Acceptable? [10:47:56] btullis: we can reclaim some space: [10:48:05] /srv/analytics.wikimedia.org/published/datasets/one-off/ladsgroup [10:48:11] That's about 90Gb [10:48:26] I talked to Amir, he's ok for us deleting it [10:48:32] OK, cool. [10:48:45] I'm gonna check on machines seeing if the original files are still there [10:50:53] btullis: we should drop what's in stat1007:/srv/published/datasets/one-off/ladsgroup [10:51:55] joal: Can do, but that's only 1.8 GB though. It must be coming from multiple stats servers. [10:51:59] https://www.irccloud.com/pastebin/XhilVYhg/ [10:52:23] yes I know - they're gonna make their way back there if we don't delete them :) [10:52:31] btullis: let's delete them please - lcening stuff :) [10:52:37] Yep. Deleting it now. [10:53:02] https://www.irccloud.com/pastebin/wQSRsF8f/ [10:53:18] btullis: the big one comes from stat1005 [10:53:27] Same folder to clean on stat1005 please [10:54:08] Done. [10:54:12] https://www.irccloud.com/pastebin/DWDrFT85/ [10:54:40] ok coll - this will hopefully buy us the time to talk with fab about those 500Gb [10:54:57] I'll delete them on an-web1001 and then re-trigger the rync process. OK by you? [10:55:00] btullis: We probably should have a policy about folder sizes, and retention [10:55:22] btullis: deleting the 500Gb files? [10:56:06] Nono, sorry for being vague. Deleting the ladsgroup files in the target on an-web1001 as well. So that the rsync has some room to work. [10:56:16] btullis: yes awesome :) [10:56:24] btullis: thanks a million [11:01:40] https://www.irccloud.com/pastebin/todZsM96/ [11:02:19] we're good for hopefully the weekend :) [11:02:54] Yes, I think so. I didn't need to restart the rsync service on an-web1001, it has picked up and seems to be OK. [11:03:03] awesome [11:09:49] RECOVERY - Disk space on an-web1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-web1001&var-datasource=eqiad+prometheus/ops [12:27:42] thanks yall :) <3 [12:27:49] (SystemdUnitFailed) firing: (4) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:28:27] yw :-) [12:31:49] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:40:15] (PuppetConstantChange) firing: (4) Puppet performing a change on every puppet run on druid1006:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [13:13:59] 10Data-Engineering, 10serviceops, 10Event-Platform: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) Tested event handling of the Lift Wing rules in staging, everything looks good afaics. [13:30:58] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10serviceops, 10Event-Platform: A rolling restart of eventgate-main seems to cause many client failures - https://phabricator.wikimedia.org/T349823 (10Ottomata) This will be a problem not just for jobs, but for all events sent by EventBus.... [13:40:05] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Modify mediawiki.revision.visibility-change to include unsuppressed data - https://phabricator.wikimedia.org/T349845 (10Ottomata) [13:42:23] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Metrics, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10Antoine_Quhen) Running locally with Docker, I get those a sample: ` a... [13:44:42] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 2 others: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10Ottomata) [13:47:37] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10serviceops, 10Event-Platform: A rolling restart of eventgate-main seems to cause many client failures - https://phabricator.wikimedia.org/T349823 (10JMeybohm) >>! In T349823#9287341, @Ottomata wrote: > Is there some way to remove ingress whi... [13:53:26] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 4 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10Ottomata) [13:53:46] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 4 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10Ottomata) [14:10:25] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10serviceops, 10Event-Platform: A rolling restart of eventgate-main seems to cause many client failures - https://phabricator.wikimedia.org/T349823 (10Ottomata) Ah, great. Okay so IIUC, we should - upgrade relevant vendor templates in eventg... [14:11:54] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10serviceops, 10Event-Platform: [Event Platform] Gracefully handle pod termination in eventgate Helm chart - https://phabricator.wikimedia.org/T349823 (10Ottomata) [14:15:35] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10BTullis) [14:17:36] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10serviceops, 10Event-Platform: [Event Platform] Gracefully handle pod termination in eventgate Helm chart - https://phabricator.wikimedia.org/T349823 (10JMeybohm) >>! In T349823#9287469, @Ottomata wrote: > @JMeybohm does that sound right? Yes... [14:26:04] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) > Do any of the topics that power EventStreams today have... [14:32:13] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10CodeReviewBot) xcollazo merged https://gitlab.wikimedia.org/repos/da... [14:36:35] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Patch-For-Review: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10BTullis) The patch is ready to go. I'm just going to try to ascertain if there are any concerns about enabling the plugin by asking. [14:43:20] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10CodeReviewBot) xcollazo opened https://gitlab.wikimedia.org/repos/da... [14:49:56] 10Data-Platform-SRE: /srv on an-web1001 is low on disk space - https://phabricator.wikimedia.org/T349889 (10BTullis) 05Open→03Resolved This incident is now resolved. We removed some datasets from `/srv/published-datasets/one-off/ladsgroup` which allowed the new and updated datasets to continue syncing. (Tha... [14:54:59] (03CR) 10Jdlrobson: Adds skin field in mobilewebuiactions (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968674 (https://phabricator.wikimedia.org/T346106) (owner: 10Kimberly Sarabia) [14:59:23] 10Data-Platform-SRE, 10Discovery-Search (Current work): Create dashboards/alerts for new Cirrus Streaming Updater - https://phabricator.wikimedia.org/T349772 (10bking) [15:47:17] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking) [16:01:23] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking) [16:03:16] 10Data-Platform-SRE, 10serviceops-radar, 10Discovery-Search (Current work), 10Epic: Determine and control cirrus streaming updater's usage of MWAPI resources - https://phabricator.wikimedia.org/T349848 (10bking) [16:14:07] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Chlod) > cc @Chlod for JavaScript wikimedia-streams client. Thanks... [16:28:03] (SystemdUnitFailed) firing: (4) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:32:04] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [16:36:55] 10Data-Engineering, 10Tech-Docs-Team, 10Goal: Define dataset documentation strategy - https://phabricator.wikimedia.org/T349103 (10TBurmeister) Status update: I'm in the research and information-gathering phase, building my understanding of this space and meeting with subject matter experts to try to narrow... [16:40:15] (PuppetConstantChange) firing: (4) Puppet performing a change on every puppet run on druid1006:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [16:49:31] (03CR) 10Nettrom: [C: 03+1] "Looks good to me! I'm unsure whether there are additional fields needed, but as I'm unable to predict the future I think we should wait to" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [16:49:54] 10Data-Engineering, 10Data-Platform-SRE: Write a design document relating to superset on dse-k8s - https://phabricator.wikimedia.org/T349396 (10BTullis) I've made a start on this: https://docs.google.com/document/d/1PT9cRVFtN23GlWfYo-_bTUzVcK12-dSSJcX-SV4rtqs/edit It's taking longer than I expected to write... [16:52:53] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for tlywiki - https://phabricator.wikimedia.org/T345169 (10BTullis) 05Open→03Resolved a:03BTullis Apologies for the delay. This work is done now. Since the reorg I'm now in the #data-platform-sre team, so I didn't see this tick... [17:07:58] I'm off next week folks. See you on November 6th. [17:18:42] 10Data-Platform-SRE, 10Discovery-Search (Current work): Create dashboards/alerts for new Cirrus Streaming Updater - https://phabricator.wikimedia.org/T349772 (10bking) [17:52:09] 10Data-Engineering, 10DC-Ops, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10RobH) [17:52:26] 10Data-Engineering, 10DC-Ops, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10RobH) [18:17:34] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:17:49] (SystemdUnitFailed) firing: (5) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:23:04] 10Data-Engineering, 10DC-Ops, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10RobH) [18:23:27] 10Data-Engineering, 10DC-Ops, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10RobH) [18:26:19] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10serviceops, 10Event-Platform: [Event Platform] Gracefully handle pod termination in eventgate Helm chart - https://phabricator.wikimedia.org/T349823 (10Ottomata) > Regardless of the the above, this is still a valid question I'd say. Indeed!... [18:30:42] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:32:49] (SystemdUnitFailed) firing: (5) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:32:49] (SystemdUnitFailed) firing: (5) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:47:49] (SystemdUnitFailed) firing: (5) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:02:49] (SystemdUnitFailed) firing: (5) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:32:04] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:32:49] (SystemdUnitFailed) firing: (5) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:45:00] (PuppetConstantChange) firing: (4) Puppet performing a change on every puppet run on druid1006:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [22:10:08] (03CR) 10Jdlrobson: Adds new readme (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968714 (https://phabricator.wikimedia.org/T349729) (owner: 10Kimberly Sarabia) [22:49:58] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:52:49] (SystemdUnitFailed) firing: (5) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:01:38] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:02:49] (SystemdUnitFailed) firing: (5) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed