[00:14:57] PROBLEM - Check systemd state on an-conf1001 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:27] PROBLEM - Check systemd state on druid1006 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:31] PROBLEM - Check systemd state on an-conf1003 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:01] PROBLEM - Check systemd state on druid1004 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:21] PROBLEM - Check systemd state on an-conf1002 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:57] PROBLEM - Check systemd state on an-druid1001 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:57] PROBLEM - Check systemd state on druid1005 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:59] PROBLEM - Check systemd state on an-druid1002 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:17:05] PROBLEM - Check systemd state on an-druid1003 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:03] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:31] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:53:02] 10Data-Engineering: Check home/HDFS leftovers of dsharpe - https://phabricator.wikimedia.org/T310463 (10sbassett) [06:52:02] 10Data-Engineering, 10MediaViewer, 10MediaWiki-extensions-EventLogging, 10MW-1.39-notes (1.39.0-wmf.18; 2022-06-27): Decommission the MediaViewer and MultimediaViewer* instruments - https://phabricator.wikimedia.org/T310890 (10phuedx) [07:54:05] 10Data-Engineering, 10Release-Engineering-Team: Improve speed of Gitlab CI - https://phabricator.wikimedia.org/T311111 (10Antoine_Quhen) [08:00:49] 10Data-Engineering, 10Release-Engineering-Team: Improve speed of Gitlab CI - https://phabricator.wikimedia.org/T311111 (10Antoine_Quhen) [08:01:22] 10Data-Engineering, 10Release-Engineering-Team: Improve speed of Gitlab CI - https://phabricator.wikimedia.org/T311111 (10Antoine_Quhen) [08:02:04] 10Data-Engineering, 10Release-Engineering-Team: Improve speed of Gitlab CI - https://phabricator.wikimedia.org/T311111 (10TheresNoTime) > tests suit is running in 4s locally, and in 1.5min in CI how is //that// possible? 😣 [08:31:00] 10Data-Engineering, 10GitLab, 10Release-Engineering-Team: Improve speed of Gitlab CI - https://phabricator.wikimedia.org/T311111 (10hashar) + #gitlab since there is surely caching optimization that would need to be added. It looks like building the image takes a while. Probably related, last week we had Gi... [08:52:28] 10Data-Engineering, 10GitLab, 10Release-Engineering-Team, 10Performance Issue: Improve speed of Gitlab CI - https://phabricator.wikimedia.org/T311111 (10Reedy) [09:37:01] 10Data-Engineering, 10GitLab: Experiencing pipeline failure due to disk-space issues - https://phabricator.wikimedia.org/T310593 (10Jelto) Docker cache is cleaned every 24h on GitLab Runner nodes now. So failing jobs due to full docker volume should happen less frequent. [09:42:07] 10Data-Engineering, 10GitLab: Experiencing pipeline failure due to disk-space issues - https://phabricator.wikimedia.org/T310593 (10hashar) 05Open→03Resolved For the scope of this task, that solves the issue. Additional tasks can be filed to keep the cache longer, potentially share them across runners etc [10:05:14] 10Data-Engineering, 10GitLab: Experiencing pipeline failure due to disk-space issues - https://phabricator.wikimedia.org/T310593 (10Antoine_Quhen) Thanks! Also for space & speed, we may not using the ci cache properly: * https://docs.gitlab.com/ee/ci/caching/ * https://phabricator.wikimedia.org/T311111 [10:20:28] 10Data-Engineering, 10Equity-Landscape: Setup PySpark environment - https://phabricator.wikimedia.org/T310713 (10ntsako) Got PySpark to work with Airflow [11:06:39] RECOVERY - Check systemd state on an-conf1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:23] RECOVERY - Check systemd state on an-conf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:29] RECOVERY - Check systemd state on an-druid1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:39] RECOVERY - Check systemd state on an-druid1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:41] RECOVERY - Check systemd state on an-conf1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:45] RECOVERY - Check systemd state on druid1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:11] RECOVERY - Check systemd state on druid1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:23] RECOVERY - Check systemd state on druid1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:26:49] RECOVERY - Check systemd state on an-druid1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:20] Hi btullis or ottomata - do we have any idea of why an-conf1001 flapped this morning (I assume the druid flapping is a consequence of an-tool)? [11:34:47] Also, qwe've not followed up on the alert about "Namenode FSImage Age" - I think we should [11:58:29] joal: I think it's the doing of btullis who tried a recipe yesterday. [11:59:37] joal: Yes I know about an-conf - it's related to work that slyng.s has been doing on switching crons to systemd timers: https://gerrit.wikimedia.org/r/c/operations/puppet/+/777451 [12:00:37] Not a real incident and I think that he has since fixed it. X.ioNoX also asked me about it this morning, as it was all zookeeper servers, not just an-conf1001. [12:00:38] ack btullis - thanks also aqu - It would be great to have a log line on this chan when operations related to our stack are done [12:02:00] Agree, but this was someone outside of this team who did the work on zookeepers, only some of which were ours. It would be good to think of a way that we could sync that. [12:03:15] And yes, the fsimage check I did Ack it yesterday and it popped into this channel I think with a link to the ticket where it was explained: [12:03:19] https://usercontent.irccloud-cdn.com/file/qJqIsyOd/image.png [12:04:18] btullis: a quick email answer to the alert email would also do (if not a log line here :) [12:04:24] thanks for letting me know btullis [12:04:46] Yes, sorry. Lots of channels of communication to keep up with. :-) [12:05:17] For sure too many channels - And it's not a complaint btullis, more of a reminder :) [12:06:42] btullis: Do we know why the FSImage error popped up? [12:08:48] btullis: I'm asking cause I'm afraid of the namenode outage we had the other day related to FSImages as well [12:15:08] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: {Shared Event Platform] Produce new mediawii.page-change stream from MediaWiki EventBus - https://phabricator.wikimedia.org/T311129 (10Ottomata) [12:19:30] joal: Yes, we know why it occurred. It's related to the new check that I put in as a reult of the outage, which measures the age of the dump in `/srv/hadoop/name/current` [12:19:55] joal: See the commit message here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/806205 [12:20:43] This check only runs on an-master1002 becuause only the standby server writes to that directory. [12:21:25] Because of the failure of the cookbook to fail back to an-master1001, we are currently running on an-master1002 as the active and an-master1001 as the standby. [12:21:56] And therefore the check fails for an-master1002 [12:21:58] Therefore the files in `/srv/hadoop/name/current` are up to date on an-master1001, but that's not where the check is running. [12:22:04] I get it :) [12:22:05] Yes, totall. [12:22:07] y [12:22:14] Thanks a lot for the detailed explanantion [12:22:46] I couldn't work out how to tie in better business logic to Icinga, to say: only the current active master should alert. [12:23:05] I wonder if we could have a way to check the correct host based on some function result (like tell me which is master, and I check that one) [12:23:10] Maybe overly complicated though [12:26:55] I will give it some more thought. Ultimately it would be better to have it in prometheus, because adding any new checks to Icinga is a bit legacy, but having spoken to godo.g about it, we decided that the Icinga approach was the pragmatic decision: https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-observability/20220616.txt [12:27:19] There is some more discussion about it here: https://phabricator.wikimedia.org/T309649#8008629 [12:30:56] (03PS1) 10Vivian Rook: Introduce black formatting to quarry [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/807534 (https://phabricator.wikimedia.org/T288976) [12:32:35] thanks for the context btullis [12:35:34] 10Quarry, 10Patch-For-Review: Add black formatting to quarry linter - https://phabricator.wikimedia.org/T288976 (10rook) a:03rook [12:37:39] joal: Always a pleasure. I will add a note about the FSImage Age check to here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts [12:52:37] 10Data-Engineering, 10GitLab, 10Release-Engineering-Team, 10Performance Issue: Improve speed of Gitlab CI - https://phabricator.wikimedia.org/T311111 (10Ottomata) Also relevant: {T304845} discussion at https://phabricator.wikimedia.org/T304845#8017565 [12:54:54] 10Data-Engineering-Radar, 10Growth-Team, 10MediaWiki-extensions-GuidedTour, 10Patch-For-Review: Finish decommissioning the legacy GuidedTour schemas - https://phabricator.wikimedia.org/T303712 (10phuedx) I split out the removal of the EventLoggingSchemas entries in the hopes that it'll ease moving this tas... [13:01:17] 10Data-Engineering-Radar, 10Growth-Team, 10MediaWiki-extensions-GuidedTour, 10Patch-For-Review: Finish decommissioning the legacy GuidedTour schemas - https://phabricator.wikimedia.org/T303712 (10phuedx) [13:14:35] 10Analytics-Radar, 10WMDE-TechWish-Maintenance, 10WMDE-Templates-FocusArea, 10Patch-For-Review, 10WMDE-TechWish (Sprint-2021-02-03): Add missing normalization to CodeMirror Grafana board - https://phabricator.wikimedia.org/T273748 (10lilients_WMDE) a:05lilients_WMDE→03None [13:50:10] ottomata: heya :] I was modifying the airflow dev script as per your suggestions of the other day, and thought that maybe the kerberos-run-command should be run outside the script. It is currently required to kinit beforehand, if you're using the script with your own username, so if you use it as analytics-privatedata it makes sense that you ensure your principal is available. I thought that we could use the script like: [13:50:27] sudo -u analytics-privatedata bash [13:50:50] export HOME=/tmp/my_test_home [13:51:04] kerberos-run-command ./run_dev_instance.sh -p 8123 analytics [13:51:57] sorry that last command is missing the analytics-privatedata after kerberos-run-command [13:52:19] thoughts? [14:01:12] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform, 10Patch-For-Review: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [14:16:44] 10Data-Engineering-Radar, 10Growth-Team, 10MediaWiki-extensions-GuidedTour, 10Patch-For-Review: Finish decommissioning the legacy GuidedTour schemas - https://phabricator.wikimedia.org/T303712 (10Ottomata) <3 <3 <3 [14:55:00] (03PS1) 10Ottomata: Add Schema for Enriched MW Streams [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/807565 (https://phabricator.wikimedia.org/T308017) [14:56:51] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform, 10Patch-For-Review: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [15:01:53] (03PS2) 10Ottomata: WIP - Add new mediawiki entity fragments, and use them in new mediawiki page change schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/807565 (https://phabricator.wikimedia.org/T308017) [15:02:43] 10Data-Engineering-Kanban, 10Data-Catalog: Document the Pageviews Dataset - https://phabricator.wikimedia.org/T308047 (10odimitrijevic) [15:04:28] (03PS3) 10Ottomata: WIP - Add new mediawiki entity fragments, and use them in new mediawiki page change schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/807565 (https://phabricator.wikimedia.org/T308017) [15:06:43] 10Data-Engineering, 10Equity-Landscape: Get PySpark to with Airflow - https://phabricator.wikimedia.org/T310713 (10ntsako) [15:08:13] (03CR) 10Ottomata: "This is intended to be used instead of https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/799351. I made a new change just because " [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/807565 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata) [15:10:07] 10Data-Engineering, 10Equity-Landscape: Get PySpark to with Airflow - https://phabricator.wikimedia.org/T310713 (10ntsako) 05Open→03Resolved p:05Triage→03Lowest [15:10:09] 10Data-Engineering, 10Equity-Landscape: Extract + Transformation Raw Data into Input Metrics - https://phabricator.wikimedia.org/T306625 (10ntsako) [15:13:03] mforns: hi just saw your message [15:13:11] heya :] [15:13:16] it does make sense, but if the intention is to make it easy for the user, perhaps all that is a bit much? [15:13:25] I see... [15:13:54] but I fear that creating the extra bash is required, since conda doesn't work without a $HOME! [15:14:20] *conda create [15:14:36] oh, maybe sudo -u analytics-privatedata is needed before the run_dev_intsance, but not the kerberos-run-command or export HOME parts? [15:14:36] so the 2 first commands, I don't know how to avoid... [15:14:41] hmmm [15:14:45] and the 3rd one is the call to the script, soo... [15:14:47] but it would be nice if that was automatic too [15:14:50] hmmm [15:14:55] ok [15:15:23] you mean let the script export HOME for you? [15:16:07] i think, if i were doing this, the code that makes and runs the dev instance woudl just be parameterized with all the bits needed. [15:16:26] then i'd have a wrapper that would handle user and env vars and parameterization stuff [15:21:28] ok ottomata will try and modify! [15:33:35] ottomata: can you please kill a process on stat1008? [15:33:45] sure [15:34:12] ottomata: 18281 it's a orphaned airflow dev instance process [15:34:30] one [15:34:31] one [15:34:32] ... [15:34:33] done. [15:34:46] thanks! [16:02:53] (03Abandoned) 10Luke Bowmaker: Add Schema for Enriched MW Streams [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/799351 (https://phabricator.wikimedia.org/T308017) (owner: 10Luke Bowmaker) [16:04:42] 10Data-Engineering-Kanban, 10DSE-Kubernetes-Cluster: Determine IP ranges for dse-k8s cluster - https://phabricator.wikimedia.org/T310169 (10BTullis) [16:10:04] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Refactor HDFSArchiveOperator to run in Skein - https://phabricator.wikimedia.org/T310542 (10Snwachukwu) [16:54:02] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10BTullis) @Cmjohnson yes please, that's perfect. [17:04:14] 10Data-Engineering, 10Product-Analytics, 10wmfdata-python: Support querying a range of hourly data partitions - https://phabricator.wikimedia.org/T294654 (10nettrom_WMF) I needed something like this for T309036, so I wrote a version that does this on a per-day basis. This can then be modified to do hourly qu... [17:17:10] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) @Cmjohnson yes please, let's use hardware RAID for this please. As @RobH suggested in the parent task, let's... > use the flex bays as a raid1 for the OS data, and the... [17:29:07] (03CR) 10Aqu: [V: 03+2 C: 03+2] "- Included in the weekly train" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/806200 (owner: 10Joal) [17:35:37] (03CR) 10Aqu: [V: 03+2 C: 03+2] "I merge because:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/807175 (https://phabricator.wikimedia.org/T310576) (owner: 10Joal) [19:16:04] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster [19:16:54] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Cmjohnson) [19:30:54] !log Deploying analytics/refinery [19:30:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:33:21] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform, 10Patch-For-Review: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [19:42:14] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster exec... [19:42:35] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye [19:46:24] ottomata: OK, I think I got it. [19:46:34] to use the script as your user: [19:46:35] RECOVERY - AQS root url on aqs2003 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [19:46:35] RECOVERY - AQS root url on aqs2004 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [19:46:51] ./run_dev_instance.sh -p 8765 analytics [19:47:01] to run as analytics-privatedata: [19:47:23] (and thus have access to skein and cluster mode) [19:48:00] sudo -u analytics-privatedata kerberos-run-command analytics-privatedata ./run_dev_instance.sh -p 8765 -m /tmp/your_home analytics [19:48:20] ottomata: I tested this ^^^ and worked for me [19:48:23] RECOVERY - Check systemd state on aqs2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:51:21] RECOVERY - Check systemd state on aqs2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:52:27] RECOVERY - AQS root url on aqs2005 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [19:53:45] RECOVERY - AQS root url on aqs2006 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [19:54:29] RECOVERY - AQS root url on aqs2007 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [19:55:45] RECOVERY - AQS root url on aqs2009 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [19:56:57] RECOVERY - Check systemd state on aqs2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:58:03] RECOVERY - AQS root url on aqs2012 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [19:58:05] RECOVERY - AQS root url on aqs2011 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [19:58:11] RECOVERY - Check systemd state on aqs2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:45] (03CR) 10Jiyu: [C: 03+1] "Looks good!" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/807534 (https://phabricator.wikimedia.org/T288976) (owner: 10Vivian Rook) [20:13:17] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye ex... [20:15:29] (03CR) 10Jiyu: [C: 03+1] "I don't understand why this bug happens but looks like it's gonna work" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/806474 (https://phabricator.wikimedia.org/T306658) (owner: 10Vivian Rook) [20:18:11] 10Data-Engineering: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10XCollazo-WMF) [20:20:51] mforns: that looks pretty good! [20:22:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service,refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:22:10] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster [20:26:11] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:26:51] ottomata: ok [20:27:45] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster exec... [20:28:06] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye [20:28:25] PROBLEM - Check unit status of refine_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:29:46] (03CR) 10Vivian Rook: [C: 03+2] Get non-coincidental history entries. [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/806474 (https://phabricator.wikimedia.org/T306658) (owner: 10Vivian Rook) [20:31:25] PROBLEM - Check unit status of refine_eventlogging_legacy on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:35:24] (03Merged) 10jenkins-bot: Get non-coincidental history entries. [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/806474 (https://phabricator.wikimedia.org/T306658) (owner: 10Vivian Rook) [20:38:01] PROBLEM - Check unit status of refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:39:37] 10Quarry, 10Patch-For-Review: Quarry history feature not showing history - https://phabricator.wikimedia.org/T306658 (10rook) 05Open→03Resolved [20:45:11] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye [20:49:41] PROBLEM - Check unit status of refine_netflow on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_netflow https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:50:33] ottomata: is there a way to know whether a user is a personal user or a user like analytics-privatedata? just by having the username? [20:55:13] !log `scap deploy -f analytics/refinery` because of a crash during `git-fat pull` [20:55:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:00:15] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:22:17] (03PS2) 10Vivian Rook: Introduce black formatting to quarry [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/807534 (https://phabricator.wikimedia.org/T288976) [21:25:11] RECOVERY - Check unit status of refine_event on an-launcher1002 is OK: OK: Status of the systemd unit refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:26:18] (03CR) 10CI reject: [V: 04-1] Introduce black formatting to quarry [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/807534 (https://phabricator.wikimedia.org/T288976) (owner: 10Vivian Rook) [21:28:07] RECOVERY - Check unit status of refine_eventlogging_legacy on an-launcher1002 is OK: OK: Status of the systemd unit refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:28:48] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye ex... [21:34:47] RECOVERY - Check unit status of refine_eventlogging_analytics on an-launcher1002 is OK: OK: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:39:57] (03PS3) 10Vivian Rook: Introduce black formatting to quarry [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/807534 (https://phabricator.wikimedia.org/T288976) [21:45:55] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye ex... [21:46:29] RECOVERY - Check unit status of refine_netflow on an-launcher1002 is OK: OK: Status of the systemd unit refine_netflow https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:46:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:10:10] (03PS4) 10Vivian Rook: Introduce black formatting to quarry [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/807534 (https://phabricator.wikimedia.org/T288976) [22:13:04] (03CR) 10CI reject: [V: 04-1] Introduce black formatting to quarry [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/807534 (https://phabricator.wikimedia.org/T288976) (owner: 10Vivian Rook) [22:15:00] (03PS5) 10Vivian Rook: Introduce black formatting to quarry [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/807534 (https://phabricator.wikimedia.org/T288976)