[00:19:15] (SystemdUnitFailed) firing: (13) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:34:14] (SystemdUnitFailed) firing: (13) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:16:04] (PuppetFailure) firing: (6) Puppet has failed on stat1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:00:25] 10Analytics-Kanban, 10Wikimedia-Medicine: Make top pages for WP:MED articles - https://phabricator.wikimedia.org/T139324 (10Harej) [04:19:15] (SystemdUnitFailed) firing: (11) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:21:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:24:15] (SystemdUnitFailed) firing: (11) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:29:15] (SystemdUnitFailed) firing: (11) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:29:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:20:49] (PuppetFailure) firing: (6) Puppet has failed on stat1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:33:32] (03CR) 10KCVelaga (wikimf): [C: 03+2] content_translation_event: Add more event_source values [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/984864 (https://phabricator.wikimedia.org/T353615) (owner: 10Bearloga) [05:34:05] (03Merged) 10jenkins-bot: content_translation_event: Add more event_source values [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/984864 (https://phabricator.wikimedia.org/T353615) (owner: 10Bearloga) [05:58:57] (03PS1) 10KCVelaga: cx event: add event sources for return to the dashboard [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/991111 (https://phabricator.wikimedia.org/T355200) [06:13:07] (03PS41) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment of account creation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [06:14:34] (03PS42) 10Cyndywikime: Add analytics for impressions, success and abandonment of account creation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [06:16:40] (03CR) 10Cyndywikime: "Thanksfor the review.Will go ahead and use mediawiki/state/entity/user instead." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989487 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [06:17:15] (03Abandoned) 10Cyndywikime: Add user_is_temp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989487 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [06:18:08] (03CR) 10Cyndywikime: "Done" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [08:26:34] 10Data-Engineering, 10Gerrit, 10Release-Engineering-Team, 10Upstream: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10hashar) I have uploaded at https://people.wikimedia.org/~hashar/T355173/ : | [[ https://people.wikimedia.org/~hashar/T355173/local-clo... [08:30:28] (SystemdUnitFailed) firing: (10) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:12:31] 10Data-Platform-SRE, 10Patch-For-Review: Ensure Elastic stack works on bookworm - https://phabricator.wikimedia.org/T353392 (10Gehel) [09:21:04] (PuppetFailure) firing: (6) Puppet has failed on stat1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:24:38] puppet is currently failing on all stat boxes with error: Could not delete user daniram: Execution of '/usr/sbin/userdel daniram' returned 8:, userdel: user daniram is currently used by process . Is that something we know about? [09:25:26] we just have to `kill ` - whatever it is. It means that a user has been deleted, but still has a process open. [09:27:41] btullis: I might be 3' late to our meeting. [09:27:55] Perfect, thanks :-) [09:30:50] btullis: Actually I'm there! [09:31:06] I have a problem with my furnace and there's a technician trying to fix it [09:34:37] Hi brouberol! Thanks for the review: https://gerrit.wikimedia.org/r/c/operations/puppet/+/990688 . I have no right to merge puppet, so can you do it ? [09:35:22] done! [09:36:41] Thanks [09:50:49] (PuppetFailure) firing: (6) Puppet has failed on stat1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:55:49] (PuppetFailure) firing: (6) Puppet has failed on stat1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:57:29] I killed all processes owned by daniram on stat boxes [10:00:49] (PuppetFailure) firing: (6) Puppet has failed on stat1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:05:49] (PuppetFailure) resolved: (6) Puppet has failed on stat1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:51:25] 10Data-Engineering, 10Gerrit, 10Release-Engineering-Team, 10Upstream: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10gmodena) @thcipriani just wanted to give an ack that I managed to reproduce `jgit clone` case. However, `jgit fetch` failed on the `dep... [11:39:30] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:44:04] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:44:38] 10Data-Engineering, 10Data Products, 10MediaWiki-extensions-EventLogging, 10CSS: Schema code samples popup appears under the JSON table - https://phabricator.wikimedia.org/T272857 (10phuedx) [12:30:29] (SystemdUnitFailed) firing: (10) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:47:31] 10Data-Engineering (Sprint 7): [Refine System] Define a concept and an approach for refactoring the Refine system - https://phabricator.wikimedia.org/T354696 (10lbowmaker) [12:48:18] 10Data-Engineering (Sprint 7): [Iceberg Migration] Define sensor concept and implementation plan - https://phabricator.wikimedia.org/T354695 (10lbowmaker) [12:49:16] 10Data-Engineering (Sprint 7): [Data Quality] Implement basic data quality metrics for MW history - https://phabricator.wikimedia.org/T354692 (10lbowmaker) [12:50:07] 10Data-Engineering (Sprint 7), 10Spike: [Data Quality] [SPIKE] Can we migrate the anomaly detection job to DeeQu checks - https://phabricator.wikimedia.org/T354566 (10lbowmaker) [12:51:29] 10Data-Engineering (Sprint 7): [Dataset Config Store] [SPIKE] Investigate existing backend solutions - https://phabricator.wikimedia.org/T354558 (10lbowmaker) [12:53:26] 10Data-Engineering (Sprint 7): [Maintenance] Migrate ReportUpdater browser queries to Airflow - https://phabricator.wikimedia.org/T354552 (10lbowmaker) [12:54:08] 10Data-Engineering (Sprint 7): [Iceberg Migration] Migrate session length tables to Iceberg - https://phabricator.wikimedia.org/T352672 (10lbowmaker) [12:55:23] 10Data-Engineering (Sprint 7): [Iceberg Migration] Migrate pageview tables to Iceberg - https://phabricator.wikimedia.org/T347690 (10lbowmaker) [12:56:05] 10Data-Engineering (Sprint 7), 10Event-Platform: [Event Platform] eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10lbowmaker) [13:39:15] (SystemdUnitFailed) firing: (10) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:39:48] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10serviceops, 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10pfischer) As of today, all non-private wikis featuring the cirrussearch extension publish page_reren... [13:48:07] 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team, 10Patch-For-Review: Automate maintain-views replica depooling - https://phabricator.wikimedia.org/T300427 (10taavi) [14:25:37] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10serviceops: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10bking) [14:56:22] 10Data-Engineering, 10Gerrit, 10Release-Engineering-Team, 10Upstream: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10hashar) I don't know what is going on since when I serve the bare repository with cgit (`git daemon --export-all`) and then fetch from... [15:18:58] (03CR) 10Joal: "1 nit in commit message then good to go for me (needs testing :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/988711 (https://phabricator.wikimedia.org/T352670) (owner: 10Snwachukwu) [15:19:15] (SystemdUnitFailed) firing: (11) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:20:52] PROBLEM - Check systemd state on clouddb1015 is CRITICAL: CRITICAL - degraded: The following units failed: wmf-pt-kill@s4.service,wmf-pt-kill@s6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:21:35] 10Data-Engineering, 10Gerrit, 10Release-Engineering-Team, 10Upstream: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10hashar) [15:23:52] RECOVERY - Check systemd state on clouddb1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:24:15] (SystemdUnitFailed) firing: (11) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:24:31] (03PS4) 10Snwachukwu: Migration of browser General table to iceberg format. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/988711 (https://phabricator.wikimedia.org/T352670) [15:26:13] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:28:00] (03CR) 10Snwachukwu: Migration of browser General table to iceberg format. (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/988711 (https://phabricator.wikimedia.org/T352670) (owner: 10Snwachukwu) [15:34:12] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search (Current work): SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 (10bking) [15:50:54] 10Data-Platform-SRE: Review the use of scap + git-fat for Data Platform Engineering use cases - https://phabricator.wikimedia.org/T354936 (10bking) We already have a ticket for replacing git-fat with git-lfs ( T316876 ). If we decide to update git-fat for Python 3, then we can close that ticket. [15:52:49] 10Data-Engineering (Sprint 7): [Maintenance] Safeguard VarnishKafka to HAProxy analytics transition - https://phabricator.wikimedia.org/T354694 (10gmodena) Possibly duplicates https://phabricator.wikimedia.org/T351117 [15:55:39] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search (Current work): SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 (10bking) [15:59:11] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search (Current work): SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 (10bking) [16:13:50] 10Data-Engineering, 10Data-Platform-SRE, 10Data-Platform, 10Release-Engineering-Team, 10Discovery-Search (Current work): SonarQube build are failing with Java 11 - https://phabricator.wikimedia.org/T355122 (10CodeReviewBot) pfischer merged https://gitlab.wikimedia.org/repos/search-platform/cirrus-streami... [16:40:51] 10Data-Engineering, 10Gerrit, 10Release-Engineering-Team, 10Upstream: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10gmodena) >>! In T355173#9464084, @hashar wrote: [...] > The old git protocol has the same issue (`-c protocol.version=0`). I have no id... [16:47:27] 10Data-Engineering, 10Movement-Insights, 10Traffic, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10fkaelin) Thanks for the updates @dr0ptp4kt, and nice that you are able to reproduce such a google proxy request. One thing that I am... [18:13:05] 10Data-Engineering, 10Movement-Insights, 10Traffic, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10dr0ptp4kt) Thanks @fkaelin . Yes, those prefetches happened without clicking on them. It seems to occur both for searches originating... [18:17:16] 10Data-Engineering, 10Movement-Insights, 10Traffic, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10dr0ptp4kt) Just to put something concrete (not saying this is **the** thing), here's an interesting unit test on the prefetch predicto... [18:43:17] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10observability: Change data platform-related IRC channels to improve communication - https://phabricator.wikimedia.org/T352783 (10bking) [18:50:56] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10observability: Change data platform-related IRC channels to improve communication - https://phabricator.wikimedia.org/T352783 (10bking) Changing AC after discussions with #data-platform-sre and @Gehel . We're not going to remove the current IRC channels. We're j... [18:51:29] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10observability: Change data platform-related IRC channels to improve communication - https://phabricator.wikimedia.org/T352783 (10bking) [19:25:29] (SystemdUnitFailed) firing: (9) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:35:59] 10Data-Engineering, 10Movement-Insights, 10Traffic, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10dr0ptp4kt) [19:38:39] 10Data-Engineering, 10Movement-Insights, 10Traffic, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10dr0ptp4kt) @fkaelin `Sec-Purpose: prefetch;prerender` is mentioned for the omnibox use case at https://developer.chrome.com/docs/web-p... [20:25:28] 10Data-Engineering, 10Gerrit, 10Release-Engineering-Team, 10Upstream: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10gmodena) @hashar @thcipriani the version of `git` installed on deploy2002 (2.20.1) does not support `--refetch`: ` gmodena@deploy2002... [20:41:49] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search (Current work): SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 (10bking) a:03bking [21:20:56] 10Data-Engineering, 10Gerrit, 10Release-Engineering-Team, 10Upstream: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10hashar) @gmodena I forgot the deployment server have an old version of git :-\ The issue is on the server side anyway and the bare rep... [21:26:56] 10Data-Engineering, 10Release-Engineering-Team, 10Gerrit (Gerrit 3.7), 10Upstream: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10hashar) 05Open→03Stalled I have worked around the issue by doing a fresh clone on the deployment server, though that f... [22:11:12] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search (Current work): SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 (10bking) [23:19:24] 10Data-Engineering, 10Release-Engineering-Team, 10Gerrit (Gerrit 3.7), 10Upstream: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10thcipriani) >>! In T355173#9467192, @hashar wrote: > So that is worked around for now and I am marking this task stalled p... [23:25:29] (SystemdUnitFailed) firing: (9) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed