[00:14:57] <icinga-wm>	 PROBLEM - Check systemd state on an-conf1001 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:15:27] <icinga-wm>	 PROBLEM - Check systemd state on druid1006 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:15:31] <icinga-wm>	 PROBLEM - Check systemd state on an-conf1003 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:16:01] <icinga-wm>	 PROBLEM - Check systemd state on druid1004 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:16:21] <icinga-wm>	 PROBLEM - Check systemd state on an-conf1002 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:16:57] <icinga-wm>	 PROBLEM - Check systemd state on an-druid1001 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:16:57] <icinga-wm>	 PROBLEM - Check systemd state on druid1005 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:16:59] <icinga-wm>	 PROBLEM - Check systemd state on an-druid1002 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:17:05] <icinga-wm>	 PROBLEM - Check systemd state on an-druid1003 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:21:03] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:30:31] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:53:02] <wikibugs>	 10Data-Engineering: Check home/HDFS leftovers of dsharpe - https://phabricator.wikimedia.org/T310463 (10sbassett)
[06:52:02] <wikibugs>	 10Data-Engineering, 10MediaViewer, 10MediaWiki-extensions-EventLogging, 10MW-1.39-notes (1.39.0-wmf.18; 2022-06-27): Decommission the MediaViewer and MultimediaViewer* instruments - https://phabricator.wikimedia.org/T310890 (10phuedx)
[07:54:05] <wikibugs>	 10Data-Engineering, 10Release-Engineering-Team: Improve speed of Gitlab CI - https://phabricator.wikimedia.org/T311111 (10Antoine_Quhen)
[08:00:49] <wikibugs>	 10Data-Engineering, 10Release-Engineering-Team: Improve speed of Gitlab CI - https://phabricator.wikimedia.org/T311111 (10Antoine_Quhen)
[08:01:22] <wikibugs>	 10Data-Engineering, 10Release-Engineering-Team: Improve speed of Gitlab CI - https://phabricator.wikimedia.org/T311111 (10Antoine_Quhen)
[08:02:04] <wikibugs>	 10Data-Engineering, 10Release-Engineering-Team: Improve speed of Gitlab CI - https://phabricator.wikimedia.org/T311111 (10TheresNoTime) > tests suit is running in 4s locally, and in 1.5min in CI  how is //that// possible? 😣
[08:31:00] <wikibugs>	 10Data-Engineering, 10GitLab, 10Release-Engineering-Team: Improve speed of Gitlab CI - https://phabricator.wikimedia.org/T311111 (10hashar) + #gitlab since there is surely caching optimization that would need to be added. It looks like building the image takes a while.   Probably related, last week we had Gi...
[08:52:28] <wikibugs>	 10Data-Engineering, 10GitLab, 10Release-Engineering-Team, 10Performance Issue: Improve speed of Gitlab CI - https://phabricator.wikimedia.org/T311111 (10Reedy)
[09:37:01] <wikibugs>	 10Data-Engineering, 10GitLab: Experiencing pipeline failure due to disk-space issues - https://phabricator.wikimedia.org/T310593 (10Jelto) Docker cache is cleaned every 24h on GitLab Runner nodes now. So failing jobs due to full docker volume should happen less frequent.
[09:42:07] <wikibugs>	 10Data-Engineering, 10GitLab: Experiencing pipeline failure due to disk-space issues - https://phabricator.wikimedia.org/T310593 (10hashar) 05Open→03Resolved For the scope of this task, that solves the issue. Additional tasks can be filed to keep the cache longer, potentially share them across runners etc
[10:05:14] <wikibugs>	 10Data-Engineering, 10GitLab: Experiencing pipeline failure due to disk-space issues - https://phabricator.wikimedia.org/T310593 (10Antoine_Quhen) Thanks!  Also for space & speed, we may not using the ci cache properly: * https://docs.gitlab.com/ee/ci/caching/ * https://phabricator.wikimedia.org/T311111
[10:20:28] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Setup PySpark environment - https://phabricator.wikimedia.org/T310713 (10ntsako) Got PySpark to work with Airflow
[11:06:39] <icinga-wm>	 RECOVERY - Check systemd state on an-conf1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:22:23] <icinga-wm>	 RECOVERY - Check systemd state on an-conf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:24:29] <icinga-wm>	 RECOVERY - Check systemd state on an-druid1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:24:39] <icinga-wm>	 RECOVERY - Check systemd state on an-druid1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:24:41] <icinga-wm>	 RECOVERY - Check systemd state on an-conf1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:24:45] <icinga-wm>	 RECOVERY - Check systemd state on druid1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:25:11] <icinga-wm>	 RECOVERY - Check systemd state on druid1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:25:23] <icinga-wm>	 RECOVERY - Check systemd state on druid1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:26:49] <icinga-wm>	 RECOVERY - Check systemd state on an-druid1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:33:20] <joal>	 Hi btullis or ottomata - do we have any idea of why an-conf1001 flapped this morning (I assume the druid flapping is a consequence of an-tool)?
[11:34:47] <joal>	 Also, qwe've not followed up on the alert about "Namenode FSImage Age" - I think we should
[11:58:29] <aqu>	 joal: I think it's the doing of btullis who tried a recipe yesterday.
[11:59:37] <btullis>	 joal: Yes I know about an-conf - it's related to work that slyng.s has been doing on switching crons to systemd timers: https://gerrit.wikimedia.org/r/c/operations/puppet/+/777451
[12:00:37] <btullis>	 Not a real incident and I think that he has since fixed it. X.ioNoX also asked me about it this morning, as it was all zookeeper servers, not just an-conf1001.
[12:00:38] <joal>	 ack btullis - thanks also aqu - It would be great to have a log line on this chan when operations related to our stack are done
[12:02:00] <btullis>	 Agree, but this was someone outside of this team who did the work on zookeepers, only some of which were ours. It would be good to think of a way that we could sync that.
[12:03:15] <btullis>	 And yes, the fsimage check I did Ack it yesterday and it popped into this channel I think with a link to the ticket where it was explained:
[12:03:19] <btullis>	 https://usercontent.irccloud-cdn.com/file/qJqIsyOd/image.png
[12:04:18] <joal>	 btullis: a quick email answer to the alert email would also do (if not a log line here :)
[12:04:24] <joal>	 thanks for letting me know btullis 
[12:04:46] <btullis>	 Yes, sorry. Lots of channels of communication to keep up with. :-)
[12:05:17] <joal>	 For sure too many channels - And it's not a complaint btullis, more of a reminder :)
[12:06:42] <joal>	 btullis: Do we know why the FSImage error popped up?
[12:08:48] <joal>	 btullis: I'm asking cause I'm afraid of the namenode outage we had the other day related to FSImages as well
[12:15:08] <wikibugs>	 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: {Shared Event Platform] Produce new mediawii.page-change stream from MediaWiki EventBus - https://phabricator.wikimedia.org/T311129 (10Ottomata)
[12:19:30] <btullis>	 joal: Yes, we know why it occurred. It's related to the new check that I put in as a reult of the outage, which measures the age of the dump in `/srv/hadoop/name/current`
[12:19:55] <btullis>	 joal: See the commit message here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/806205
[12:20:43] <btullis>	 This check only runs on an-master1002 becuause only the standby server writes to that directory. 
[12:21:25] <btullis>	 Because of the failure of the cookbook to fail back to an-master1001, we are currently running on an-master1002 as the active and an-master1001 as the standby.
[12:21:56] <joal>	 And therefore the check fails for an-master1002
[12:21:58] <btullis>	 Therefore the files in `/srv/hadoop/name/current` are up to date on an-master1001, but that's not where the check is running.
[12:22:04] <joal>	 I get it :)
[12:22:05] <btullis>	 Yes, totall.
[12:22:07] <btullis>	 y
[12:22:14] <joal>	 Thanks a lot for the detailed explanantion
[12:22:46] <btullis>	 I couldn't work out how to tie in better business logic to Icinga, to say: only the current active master should alert.
[12:23:05] <joal>	 I wonder if we could have a way to check the correct host based on some function result (like tell me which is master, and I check that one)
[12:23:10] <joal>	 Maybe overly complicated though
[12:26:55] <btullis>	 I will give it some more thought. Ultimately it would be better to have it in prometheus, because adding any new checks to Icinga is a bit legacy, but having spoken to godo.g about it, we decided that the Icinga approach was the pragmatic decision: https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-observability/20220616.txt
[12:27:19] <btullis>	 There is some more discussion about it here: https://phabricator.wikimedia.org/T309649#8008629
[12:30:56] <wikibugs>	 (03PS1) 10Vivian Rook: Introduce black formatting to quarry [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/807534 (https://phabricator.wikimedia.org/T288976)
[12:32:35] <joal>	 thanks for the context btullis 
[12:35:34] <wikibugs>	 10Quarry, 10Patch-For-Review: Add black formatting to quarry linter - https://phabricator.wikimedia.org/T288976 (10rook) a:03rook
[12:37:39] <btullis>	 joal: Always a pleasure. I will add a note about the FSImage Age check to here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts
[12:52:37] <wikibugs>	 10Data-Engineering, 10GitLab, 10Release-Engineering-Team, 10Performance Issue: Improve speed of Gitlab CI - https://phabricator.wikimedia.org/T311111 (10Ottomata) Also relevant: {T304845} discussion at https://phabricator.wikimedia.org/T304845#8017565
[12:54:54] <wikibugs>	 10Data-Engineering-Radar, 10Growth-Team, 10MediaWiki-extensions-GuidedTour, 10Patch-For-Review: Finish decommissioning the legacy GuidedTour schemas - https://phabricator.wikimedia.org/T303712 (10phuedx) I split out the removal of the EventLoggingSchemas entries in the hopes that it'll ease moving this tas...
[13:01:17] <wikibugs>	 10Data-Engineering-Radar, 10Growth-Team, 10MediaWiki-extensions-GuidedTour, 10Patch-For-Review: Finish decommissioning the legacy GuidedTour schemas - https://phabricator.wikimedia.org/T303712 (10phuedx)
[13:14:35] <wikibugs>	 10Analytics-Radar, 10WMDE-TechWish-Maintenance, 10WMDE-Templates-FocusArea, 10Patch-For-Review, 10WMDE-TechWish (Sprint-2021-02-03): Add missing normalization to CodeMirror Grafana board - https://phabricator.wikimedia.org/T273748 (10lilients_WMDE) a:05lilients_WMDE→03None
[13:50:10] <mforns>	 ottomata: heya :] I was modifying the airflow dev script as per your suggestions of the other day, and thought that maybe the kerberos-run-command should be run outside the script. It is currently required to kinit beforehand, if you're using the script with your own username, so if you use it as analytics-privatedata it makes sense that you ensure your principal is available. I thought that we could use the script like:
[13:50:27] <mforns>	 sudo -u analytics-privatedata bash
[13:50:50] <mforns>	 export HOME=/tmp/my_test_home
[13:51:04] <mforns>	 kerberos-run-command ./run_dev_instance.sh -p 8123 analytics
[13:51:57] <mforns>	 sorry that last command is missing the analytics-privatedata after kerberos-run-command
[13:52:19] <mforns>	 thoughts?
[14:01:12] <wikibugs>	 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform, 10Patch-For-Review: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans)
[14:16:44] <wikibugs>	 10Data-Engineering-Radar, 10Growth-Team, 10MediaWiki-extensions-GuidedTour, 10Patch-For-Review: Finish decommissioning the legacy GuidedTour schemas - https://phabricator.wikimedia.org/T303712 (10Ottomata) <3 <3 <3
[14:55:00] <wikibugs>	 (03PS1) 10Ottomata: Add Schema for Enriched MW Streams [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/807565 (https://phabricator.wikimedia.org/T308017)
[14:56:51] <wikibugs>	 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform, 10Patch-For-Review: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans)
[15:01:53] <wikibugs>	 (03PS2) 10Ottomata: WIP - Add new mediawiki entity fragments, and use them in new mediawiki page change schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/807565 (https://phabricator.wikimedia.org/T308017)
[15:02:43] <wikibugs>	 10Data-Engineering-Kanban, 10Data-Catalog: Document the Pageviews Dataset - https://phabricator.wikimedia.org/T308047 (10odimitrijevic)
[15:04:28] <wikibugs>	 (03PS3) 10Ottomata: WIP - Add new mediawiki entity fragments, and use them in new mediawiki page change schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/807565 (https://phabricator.wikimedia.org/T308017)
[15:06:43] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Get PySpark to with Airflow - https://phabricator.wikimedia.org/T310713 (10ntsako)
[15:08:13] <wikibugs>	 (03CR) 10Ottomata: "This is intended to be used instead of https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/799351.  I made a new change just because " [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/807565 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata)
[15:10:07] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Get PySpark to with Airflow - https://phabricator.wikimedia.org/T310713 (10ntsako) 05Open→03Resolved p:05Triage→03Lowest
[15:10:09] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Extract + Transformation Raw Data into Input Metrics - https://phabricator.wikimedia.org/T306625 (10ntsako)
[15:13:03] <ottomata>	 mforns:  hi just saw your message
[15:13:11] <mforns>	 heya :]
[15:13:16] <ottomata>	 it does make sense, but if the intention is to make it easy for the user, perhaps all that is a bit much?
[15:13:25] <mforns>	 I see...
[15:13:54] <mforns>	 but I fear that creating the extra bash is required, since conda doesn't work without a $HOME!
[15:14:20] <mforns>	 *conda create
[15:14:36] <ottomata>	 oh, maybe sudo -u analytics-privatedata is needed before the run_dev_intsance, but not the kerberos-run-command or export HOME parts?
[15:14:36] <mforns>	 so the 2 first commands, I don't know how to avoid...
[15:14:41] <ottomata>	 hmmm
[15:14:45] <mforns>	 and the 3rd one is the call to the script, soo...
[15:14:47] <ottomata>	 but it would be nice if that was automatic too
[15:14:50] <ottomata>	 hmmm
[15:14:55] <mforns>	 ok
[15:15:23] <mforns>	 you mean let the script export HOME for you?
[15:16:07] <ottomata>	 i think, if i were doing this, the code that makes and runs the dev  instance woudl just be parameterized with all the bits needed.  
[15:16:26] <ottomata>	 then i'd have a wrapper that would handle user and env vars and parameterization stuff
[15:21:28] <mforns>	 ok ottomata will try and modify!
[15:33:35] <mforns>	 ottomata: can you please kill a process on stat1008?
[15:33:45] <ottomata>	 sure
[15:34:12] <mforns>	 ottomata: 18281 it's a orphaned airflow dev instance process
[15:34:30] <ottomata>	 one
[15:34:31] <ottomata>	 one
[15:34:32] <ottomata>	 ...
[15:34:33] <ottomata>	 done.
[15:34:46] <mforns>	 thanks!
[16:02:53] <wikibugs>	 (03Abandoned) 10Luke Bowmaker: Add Schema for Enriched MW Streams [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/799351 (https://phabricator.wikimedia.org/T308017) (owner: 10Luke Bowmaker)
[16:04:42] <wikibugs>	 10Data-Engineering-Kanban, 10DSE-Kubernetes-Cluster: Determine IP ranges for dse-k8s cluster - https://phabricator.wikimedia.org/T310169 (10BTullis)
[16:10:04] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Refactor HDFSArchiveOperator to run in Skein - https://phabricator.wikimedia.org/T310542 (10Snwachukwu)
[16:54:02] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10BTullis) @Cmjohnson yes please, that's perfect.
[17:04:14] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10wmfdata-python: Support querying a range of hourly data partitions - https://phabricator.wikimedia.org/T294654 (10nettrom_WMF) I needed something like this for T309036, so I wrote a version that does this on a per-day basis. This can then be modified to do hourly qu...
[17:17:10] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) @Cmjohnson yes please, let's use hardware RAID for this please. As @RobH suggested in the parent task, let's...  >  use the flex bays as a raid1 for the OS data, and the...
[17:29:07] <wikibugs>	 (03CR) 10Aqu: [V: 03+2 C: 03+2] "- Included in the weekly train" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/806200 (owner: 10Joal)
[17:35:37] <wikibugs>	 (03CR) 10Aqu: [V: 03+2 C: 03+2] "I merge because:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/807175 (https://phabricator.wikimedia.org/T310576) (owner: 10Joal)
[19:16:04] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster
[19:16:54] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Cmjohnson)
[19:30:54] <aqu>	 !log Deploying analytics/refinery
[19:30:56] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:33:21] <wikibugs>	 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform, 10Patch-For-Review: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans)
[19:42:14] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster exec...
[19:42:35] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye
[19:46:24] <mforns>	 ottomata: OK, I think I got it.
[19:46:34] <mforns>	 to use the script as your user:
[19:46:35] <icinga-wm>	 RECOVERY - AQS root url on aqs2003 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[19:46:35] <icinga-wm>	 RECOVERY - AQS root url on aqs2004 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[19:46:51] <mforns>	 ./run_dev_instance.sh -p 8765 analytics
[19:47:01] <mforns>	 to run as analytics-privatedata:
[19:47:23] <mforns>	 (and thus have access to skein and cluster mode)
[19:48:00] <mforns>	 sudo -u analytics-privatedata kerberos-run-command analytics-privatedata ./run_dev_instance.sh -p 8765 -m /tmp/your_home analytics
[19:48:20] <mforns>	 ottomata: I tested this ^^^ and worked for me
[19:48:23] <icinga-wm>	 RECOVERY - Check systemd state on aqs2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:51:21] <icinga-wm>	 RECOVERY - Check systemd state on aqs2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:52:27] <icinga-wm>	 RECOVERY - AQS root url on aqs2005 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[19:53:45] <icinga-wm>	 RECOVERY - AQS root url on aqs2006 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[19:54:29] <icinga-wm>	 RECOVERY - AQS root url on aqs2007 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[19:55:45] <icinga-wm>	 RECOVERY - AQS root url on aqs2009 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[19:56:57] <icinga-wm>	 RECOVERY - Check systemd state on aqs2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:58:03] <icinga-wm>	 RECOVERY - AQS root url on aqs2012 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[19:58:05] <icinga-wm>	 RECOVERY - AQS root url on aqs2011 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[19:58:11] <icinga-wm>	 RECOVERY - Check systemd state on aqs2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:01:45] <wikibugs>	 (03CR) 10Jiyu: [C: 03+1] "Looks good!" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/807534 (https://phabricator.wikimedia.org/T288976) (owner: 10Vivian Rook)
[20:13:17] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye ex...
[20:15:29] <wikibugs>	 (03CR) 10Jiyu: [C: 03+1] "I don't understand why this bug happens but looks like it's gonna work" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/806474 (https://phabricator.wikimedia.org/T306658) (owner: 10Vivian Rook)
[20:18:11] <wikibugs>	 10Data-Engineering: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10XCollazo-WMF)
[20:20:51] <ottomata>	 mforns:  that looks pretty good!
[20:22:01] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service,refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:22:10] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster
[20:26:11] <icinga-wm>	 PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:26:51] <mforns>	 ottomata: ok
[20:27:45] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster exec...
[20:28:06] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye
[20:28:25] <icinga-wm>	 PROBLEM - Check unit status of refine_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:29:46] <wikibugs>	 (03CR) 10Vivian Rook: [C: 03+2] Get non-coincidental history entries. [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/806474 (https://phabricator.wikimedia.org/T306658) (owner: 10Vivian Rook)
[20:31:25] <icinga-wm>	 PROBLEM - Check unit status of refine_eventlogging_legacy on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:35:24] <wikibugs>	 (03Merged) 10jenkins-bot: Get non-coincidental history entries. [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/806474 (https://phabricator.wikimedia.org/T306658) (owner: 10Vivian Rook)
[20:38:01] <icinga-wm>	 PROBLEM - Check unit status of refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:39:37] <wikibugs>	 10Quarry, 10Patch-For-Review: Quarry history feature not showing history - https://phabricator.wikimedia.org/T306658 (10rook) 05Open→03Resolved
[20:45:11] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye
[20:49:41] <icinga-wm>	 PROBLEM - Check unit status of refine_netflow on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_netflow https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:50:33] <mforns>	 ottomata: is there a way to know whether a user is a personal user or a user like analytics-privatedata? just by having the username?
[20:55:13] <aqu>	 !log `scap deploy -f analytics/refinery` because of a crash during `git-fat pull`
[20:55:15] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:00:15] <icinga-wm>	 RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[21:22:17] <wikibugs>	 (03PS2) 10Vivian Rook: Introduce black formatting to quarry [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/807534 (https://phabricator.wikimedia.org/T288976)
[21:25:11] <icinga-wm>	 RECOVERY - Check unit status of refine_event on an-launcher1002 is OK: OK: Status of the systemd unit refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[21:26:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Introduce black formatting to quarry [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/807534 (https://phabricator.wikimedia.org/T288976) (owner: 10Vivian Rook)
[21:28:07] <icinga-wm>	 RECOVERY - Check unit status of refine_eventlogging_legacy on an-launcher1002 is OK: OK: Status of the systemd unit refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[21:28:48] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye ex...
[21:34:47] <icinga-wm>	 RECOVERY - Check unit status of refine_eventlogging_analytics on an-launcher1002 is OK: OK: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[21:39:57] <wikibugs>	 (03PS3) 10Vivian Rook: Introduce black formatting to quarry [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/807534 (https://phabricator.wikimedia.org/T288976)
[21:45:55] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye ex...
[21:46:29] <icinga-wm>	 RECOVERY - Check unit status of refine_netflow on an-launcher1002 is OK: OK: Status of the systemd unit refine_netflow https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[21:46:53] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:10:10] <wikibugs>	 (03PS4) 10Vivian Rook: Introduce black formatting to quarry [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/807534 (https://phabricator.wikimedia.org/T288976)
[22:13:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Introduce black formatting to quarry [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/807534 (https://phabricator.wikimedia.org/T288976) (owner: 10Vivian Rook)
[22:15:00] <wikibugs>	 (03PS5) 10Vivian Rook: Introduce black formatting to quarry [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/807534 (https://phabricator.wikimedia.org/T288976)