[00:02:49] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:30:11] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:32:49] <jinxer-wm>	 (SystemdUnitFailed) resolved: monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:32:49] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:33:55] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:45:31] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:47:49] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:27:49] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:28:59] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:23:07] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for tlywiki - https://phabricator.wikimedia.org/T345169 (10Marostegui) @btullis can you give this some priority? It's been sitting here for a while and {T349424} is blocked on it.
[06:54:14] <wikibugs>	 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10gmodena) >>! In T349118#9285852, @Jdforrester-WMF wrote: >>>! In T349118#9285601, @gmodena wrote: >> **Data Engineering...
[08:19:59] <jinxer-wm>	 (PuppetConstantChange) firing: Puppet performing a change on every puppet run on druid1009:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[08:27:49] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:30:00] <jinxer-wm>	 (PuppetConstantChange) firing: (3) Puppet performing a change on every puppet run on druid1006:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[08:40:00] <jinxer-wm>	 (PuppetConstantChange) firing: (4) Puppet performing a change on every puppet run on druid1006:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[08:46:42] <wikibugs>	 10Data-Engineering, 10serviceops, 10Event-Platform: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) Found a way to generate Flame Graphs:  * nodejs 10 version: {F40463091} * nodejs 18 version: {F40462916}  Procedure:  * Added the following to nodejs `--perf-basic...
[08:55:36] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Metrics, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10fgiunchedi) >>! In T343232#9285062, @BTullis wrote: > ...but I don't...
[09:32:49] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:47:49] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:53:13] <jinxer-wm>	 (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.154% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[09:53:58] <btullis>	 ^ uh-oh. We had better have a look at /srv/ on an-web1001.
[09:58:30] <wikibugs>	 10Data-Platform-SRE: /srv on an-web1001 is low on disk space - https://phabricator.wikimedia.org/T349889 (10BTullis)
[09:58:52] <wikibugs>	 10Data-Platform-SRE: /srv on an-web1001 is low on disk space - https://phabricator.wikimedia.org/T349889 (10BTullis) p:05Triage→03High
[09:59:34] <btullis>	 It's continuing to fill up pretty rapidly, so there is only 50 GB left.
[09:59:39] <btullis>	 https://www.irccloud.com/pastebin/ISViBFsD/
[10:03:43] <btullis>	 I put out a message on Slack as well, in #data-engineering-collab : https://wikimedia.slack.com/archives/CSV483812/p1698400981595519 asking if anyone is in the middle of publishing anything.
[10:05:29] <wikibugs>	 10Data-Platform-SRE: /srv on an-web1001 is low on disk space - https://phabricator.wikimedia.org/T349889 (10BTullis) Ah good, it's levelled out at 97% and does not seem to be increasing. {F40477027,width=50%}
[10:09:03] <icinga-wm>	 PROBLEM - Disk space on an-web1001 is CRITICAL: DISK CRITICAL - free space: /srv 46656 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-web1001&var-datasource=eqiad+prometheus/ops
[10:10:39] <btullis>	 I have submitted a 7 day silence for the alert. It looks like it has levelled out at 97% which is good for now.
[10:29:43] <joal>	 btullis: I have inspected a bit, and there is one dataset that weights more than 500Gb
[10:29:52] <joal>	 btullis: I guess we should investigate thiso ne
[10:30:08] <joal>	 btullis: /srv/analytics.wikimedia.org/published/datasets/one-off/caption_competition
[10:30:15] <wikibugs>	 10Data-Platform-SRE: /srv on an-web1001 is low on disk space - https://phabricator.wikimedia.org/T349889 (10BTullis) Not good. It is tracking up again. {F40479749,width=50%} I started a conversation about it on Slace here: https://wikimedia.slack.com/archives/CSV483812/p1698400981595519  I'm running commands lik...
[10:30:55] <btullis>	 OK, let's wait one miinute joal, I think I might have found 41 GB that might get freed up in a sec.
[10:31:17] <joal>	 btullis: Yeah, but we're talking small in compraison to those 500Gb :)
[10:32:17] <btullis>	 Oh definitely, we should absolutely look at the 500 GB dataset. I'm only talking about trying to avoid hitting 100% temporarily.
[10:32:28] <joal>	 yeah I hear you
[10:33:34] <btullis>	 These are new source files on stat1007 that I think are causing us trouble.
[10:33:38] <btullis>	 https://www.irccloud.com/pastebin/4uuo0zg5/
[10:34:12] <joal>	 Right, I saw that paragon has a big folder as well (almost 80Gb)
[10:34:33] <btullis>	 We might be ok, because I think that it will get to 49 GB here, then unlink the 41 GB file.
[10:34:37] <btullis>	 https://www.irccloud.com/pastebin/1Kn2U5Id/
[10:34:58] <btullis>	 ...but then it will probably try to rsync the new 33 GB file as well.
[10:35:43] <btullis>	 No, I don't think there's space.
[10:35:46] <btullis>	 https://www.irccloud.com/pastebin/aNKlPV4v/
[10:37:16] <btullis>	 So thinking about it,  this will likely fail, delete the 46 GB temporary file, continue to the 33 GB file, then throw an alert and try again soon.
[10:37:51] <btullis>	 Maybe we *should* do something about that 500 GB dataset that you mentioned now.
[10:38:44] <joal>	 btullis: I pinged fab on the slack thread, he's the one having generated the data
[10:39:08] <btullis>	 Ah cool. Thanks. I wondered who it was.
[10:40:05] <joal>	 btullis: paragon has 82Gb to be synced from stat1007, and it seems to be done
[10:40:28] <btullis>	 Not according to this:
[10:40:32] <btullis>	 https://www.irccloud.com/pastebin/Z4r9EQyr/
[10:40:43] <joal>	 btullis: hopefully the sync process will finish successfully and we'll be able to drop some big stuff soon?
[10:41:11] <btullis>	 Yeah, there is also this, but it's small fry.
[10:41:15] <btullis>	 https://www.irccloud.com/pastebin/kurZZ8XT/
[10:43:08] <btullis>	 https://www.irccloud.com/pastebin/MdcFApOe/
[10:44:03] <joal>	 yeah - the big ones (more than 50Gb) are paragon with ~90Gb (those files currently being published), the caption_competition one ~500Gb (from fab), santosh with ~60Gb, ladsgroup with 94Gb and diego with 54Gb
[10:47:07] <btullis>	 I could delete the existing file (two hardlinks) and then it would probably succeed, because it wouldn't have to have a temporary copy whilst rsyncing the new one.
[10:47:12] <btullis>	 https://www.irccloud.com/pastebin/PVuEl4jo/
[10:47:37] <btullis>	 What do you think? Acceptable?
[10:47:56] <joal>	 btullis: we can reclaim some space:
[10:48:05] <joal>	 /srv/analytics.wikimedia.org/published/datasets/one-off/ladsgroup
[10:48:11] <joal>	 That's about 90Gb
[10:48:26] <joal>	 I talked to Amir, he's ok for us deleting it
[10:48:32] <btullis>	 OK, cool.
[10:48:45] <joal>	 I'm gonna check on machines seeing if the original files are still there
[10:50:53] <joal>	 btullis: we should drop what's in stat1007:/srv/published/datasets/one-off/ladsgroup
[10:51:55] <btullis>	 joal: Can do, but that's only 1.8 GB though. It must be coming from multiple stats servers.
[10:51:59] <btullis>	 https://www.irccloud.com/pastebin/XhilVYhg/
[10:52:23] <joal>	 yes I know - they're gonna make their way back there if we don't delete them :)
[10:52:31] <joal>	 btullis: let's delete them please - lcening stuff :)
[10:52:37] <btullis>	 Yep. Deleting it now.
[10:53:02] <btullis>	 https://www.irccloud.com/pastebin/wQSRsF8f/
[10:53:18] <joal>	 btullis: the big one comes from stat1005
[10:53:27] <joal>	 Same folder to clean on stat1005 please
[10:54:08] <btullis>	 Done.
[10:54:12] <btullis>	 https://www.irccloud.com/pastebin/DWDrFT85/
[10:54:40] <joal>	 ok coll - this will hopefully buy us the time to talk with fab about those 500Gb
[10:54:57] <btullis>	 I'll delete them on an-web1001 and then re-trigger the rync process. OK by you?
[10:55:00] <joal>	 btullis: We probably should have a policy about folder sizes, and retention
[10:55:22] <joal>	 btullis: deleting the 500Gb files?
[10:56:06] <btullis>	 Nono, sorry for being vague. Deleting the ladsgroup files in the target on an-web1001 as well. So that the rsync has some room to work.
[10:56:16] <joal>	 btullis: yes awesome :)
[10:56:24] <joal>	 btullis: thanks a million
[11:01:40] <btullis>	 https://www.irccloud.com/pastebin/todZsM96/
[11:02:19] <joal>	 we're good for hopefully the weekend :)
[11:02:54] <btullis>	 Yes, I think so. I didn't need to restart the rsync service on an-web1001, it has picked up and seems to be OK.
[11:03:03] <joal>	 awesome
[11:09:49] <icinga-wm>	 RECOVERY - Disk space on an-web1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-web1001&var-datasource=eqiad+prometheus/ops
[12:27:42] <ottomata>	 thanks yall :) <3
[12:27:49] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:28:27] <btullis>	 yw :-)
[12:31:49] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: (3)  crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[12:40:15] <jinxer-wm>	 (PuppetConstantChange) firing: (4) Puppet performing a change on every puppet run on druid1006:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[13:13:59] <wikibugs>	 10Data-Engineering, 10serviceops, 10Event-Platform: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) Tested event handling of the Lift Wing rules in staging, everything looks good afaics.
[13:30:58] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10serviceops, 10Event-Platform: A rolling restart of eventgate-main seems to cause many client failures - https://phabricator.wikimedia.org/T349823 (10Ottomata) This will be a problem not just for jobs, but for all events sent by EventBus....
[13:40:05] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Modify mediawiki.revision.visibility-change to include unsuppressed data - https://phabricator.wikimedia.org/T349845 (10Ottomata)
[13:42:23] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Metrics, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10Antoine_Quhen) Running locally with Docker, I get those a sample: ` a...
[13:44:42] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 2 others: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10Ottomata)
[13:47:37] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10serviceops, 10Event-Platform: A rolling restart of eventgate-main seems to cause many client failures - https://phabricator.wikimedia.org/T349823 (10JMeybohm) >>! In T349823#9287341, @Ottomata wrote: > Is there some way to remove ingress whi...
[13:53:26] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 4 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10Ottomata)
[13:53:46] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 4 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10Ottomata)
[14:10:25] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10serviceops, 10Event-Platform: A rolling restart of eventgate-main seems to cause many client failures - https://phabricator.wikimedia.org/T349823 (10Ottomata) Ah, great.  Okay so IIUC, we should  - upgrade relevant vendor templates in eventg...
[14:11:54] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10serviceops, 10Event-Platform: [Event Platform] Gracefully handle pod termination in eventgate Helm chart - https://phabricator.wikimedia.org/T349823 (10Ottomata)
[14:15:35] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10BTullis)
[14:17:36] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10serviceops, 10Event-Platform: [Event Platform] Gracefully handle pod termination in eventgate Helm chart - https://phabricator.wikimedia.org/T349823 (10JMeybohm) >>! In T349823#9287469, @Ottomata wrote: > @JMeybohm does that sound right? Yes...
[14:26:04] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) > Do any of the topics that power EventStreams today have...
[14:32:13] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10CodeReviewBot) xcollazo merged https://gitlab.wikimedia.org/repos/da...
[14:36:35] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Patch-For-Review: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10BTullis) The patch is ready to go. I'm just going to try to ascertain if there are any concerns about enabling the plugin by asking.
[14:43:20] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10CodeReviewBot) xcollazo opened https://gitlab.wikimedia.org/repos/da...
[14:49:56] <wikibugs>	 10Data-Platform-SRE: /srv on an-web1001 is low on disk space - https://phabricator.wikimedia.org/T349889 (10BTullis) 05Open→03Resolved This incident is now resolved.  We removed some datasets from `/srv/published-datasets/one-off/ladsgroup` which allowed the new and updated datasets to continue syncing. (Tha...
[14:54:59] <wikibugs>	 (03CR) 10Jdlrobson: Adds skin field in mobilewebuiactions (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968674 (https://phabricator.wikimedia.org/T346106) (owner: 10Kimberly Sarabia)
[14:59:23] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Create dashboards/alerts for new Cirrus Streaming Updater - https://phabricator.wikimedia.org/T349772 (10bking)
[15:47:17] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking)
[16:01:23] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking)
[16:03:16] <wikibugs>	 10Data-Platform-SRE, 10serviceops-radar, 10Discovery-Search (Current work), 10Epic: Determine and control cirrus streaming updater's usage of MWAPI resources - https://phabricator.wikimedia.org/T349848 (10bking)
[16:14:07] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Chlod) > cc @Chlod for JavaScript wikimedia-streams client.  Thanks...
[16:28:03] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:32:04] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: (3)  crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[16:36:55] <wikibugs>	 10Data-Engineering, 10Tech-Docs-Team, 10Goal: Define dataset documentation strategy - https://phabricator.wikimedia.org/T349103 (10TBurmeister) Status update: I'm in the research and information-gathering phase, building my understanding of this space and meeting with subject matter experts to try to narrow...
[16:40:15] <jinxer-wm>	 (PuppetConstantChange) firing: (4) Puppet performing a change on every puppet run on druid1006:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[16:49:31] <wikibugs>	 (03CR) 10Nettrom: [C: 03+1] "Looks good to me! I'm unsure whether there are additional fields needed, but as I'm unable to predict the future I think we should wait to" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx)
[16:49:54] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Write a design document relating to superset on dse-k8s - https://phabricator.wikimedia.org/T349396 (10BTullis) I've made a start on this:  https://docs.google.com/document/d/1PT9cRVFtN23GlWfYo-_bTUzVcK12-dSSJcX-SV4rtqs/edit  It's taking longer than I expected to write...
[16:52:53] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for tlywiki - https://phabricator.wikimedia.org/T345169 (10BTullis) 05Open→03Resolved a:03BTullis Apologies for the delay. This work is done now. Since the reorg I'm now in the #data-platform-sre team, so I didn't see this tick...
[17:07:58] <btullis>	 I'm off next week folks. See you on November 6th.
[17:18:42] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Create dashboards/alerts for new Cirrus Streaming Updater - https://phabricator.wikimedia.org/T349772 (10bking)
[17:52:09] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10RobH)
[17:52:26] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10RobH)
[18:17:34] <icinga-wm>	 PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:17:49] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:23:04] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10RobH)
[18:23:27] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10RobH)
[18:26:19] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10serviceops, 10Event-Platform: [Event Platform] Gracefully handle pod termination in eventgate Helm chart - https://phabricator.wikimedia.org/T349823 (10Ottomata) > Regardless of the the above, this is still a valid question I'd say.  Indeed!...
[18:30:42] <icinga-wm>	 RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:32:49] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:32:49] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:47:49] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:02:49] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:32:04] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: (3)  crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:32:49] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:45:00] <jinxer-wm>	 (PuppetConstantChange) firing: (4) Puppet performing a change on every puppet run on druid1006:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[22:10:08] <wikibugs>	 (03CR) 10Jdlrobson: Adds new readme (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968714 (https://phabricator.wikimedia.org/T349729) (owner: 10Kimberly Sarabia)
[22:49:58] <icinga-wm>	 PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:52:49] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:01:38] <icinga-wm>	 RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:02:49] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed