[06:44:51] (03CR) 10Phuedx: Add Metrics Platform fragments by platform only (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [09:30:30] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10Gehel) Current status: `gehel@cumin1001:~$ sudo cumin 'A:wdqs-all OR A:wcqs-public' 'cat /etc/debian_version' 35 hosts will be targeted: wcqs[2001-2003].codfw.wmnet,wcqs[1001-1003].eqiad.wmnet,wd... [09:50:35] 10Data-Engineering: Check home/HDFS leftovers of tmtl.io contractors - https://phabricator.wikimedia.org/T340942 (10BTullis) 05Open→03Resolved a:03BTullis Thanks @Htriedman for the confirmation. Removed the remaining home directories with: ` sudo cumin 'C:profile::analytics::cluster::client or C:profile::... [10:03:16] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10jbond) @Stevemunene @BTullis great work on the progress you have made and sorry for my silence. Just want to say +1 to using option 1. the id value you have the... [10:17:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:19:48] 10Data-Platform-SRE: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10BTullis) Moving to #data-platform-sre for re-triage. [10:20:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:21:03] 10Data-Platform-SRE: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10BTullis) a:05odimitrijevic→03None [10:30:44] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:33:26] 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Physikerwelt) [10:35:21] 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Physikerwelt) [10:37:55] 10Data-Platform-SRE, 10Infrastructure-Foundations: krb1001's auth.log grows a lot causing disk space issues for the root partition - https://phabricator.wikimedia.org/T302518 (10BTullis) Re-tagging to allow-re-triage. [11:11:50] 10Data-Platform-SRE, 10Infrastructure-Foundations: krb1001's auth.log grows a lot causing disk space issues for the root partition - https://phabricator.wikimedia.org/T302518 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff I fixed this a few months ago via https://phabricator.wikimedia.org/... [11:12:29] 10Data-Platform-SRE, 10Discovery-Search, 10SRE: Unable to use kafka-topic.sh - Topic authorization failed - https://phabricator.wikimedia.org/T344989 (10pfischer) [12:31:07] 10Data-Engineering, 10All-and-every-Wikisource, 10ArticlePlaceholder, 10BetaFeatures, and 55 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10phuedx) [13:54:42] (03CR) 10Mforns: "LGTM!! Left 1 inline question regarding $refs." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [15:23:15] (03PS3) 10Phuedx: Add analytics/metrics_platform/{app,web}_click schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) [15:23:38] (03CR) 10Phuedx: Add analytics/metrics_platform/{app,web}_click schemas (034 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [15:30:05] PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [15:50:11] RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [16:28:34] 10Data-Platform-SRE: Ensure Data Platform SREs have a contact group in puppet/alerting - https://phabricator.wikimedia.org/T342578 (10BTullis) [16:43:06] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) I have created a patch to build docker images of Spark version 3.3.3, from which we can extract the spark-yarn-shuffler jar.... [16:58:38] 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10BTullis) p:05High→03Medium [17:03:58] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1009.eqiad.wmnet with OS bullseye [17:08:36] 10Data-Engineering, 10Data Pipelines (Sprint 14), 10Data Products (Sprint 00), 10Google-Chrome-User-Agent-Deprecation, 10Product-Analytics (Kanban): Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10VirginiaPoundstone) [17:09:05] 10Data-Engineering, 10Data Pipelines (Sprint 14), 10Data Products (Sprint 00), 10Google-Chrome-User-Agent-Deprecation, 10Product-Analytics (Kanban): [SPIKE] Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10VirginiaPoundstone) [17:13:28] 10Analytics, 10AQS2.0, 10Data Products, 10Tech-Docs-Team, and 6 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10VirginiaPoundstone) [17:36:55] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1009.eqiad.wmnet with OS bullseye executed with errors: - wdqs1009 (**FAIL**) - Downtimed on Icin... [19:17:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:19:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:32:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:38:37] 10Data-Engineering, 10Data-Platform-SRE, 10Scap: analytics/refinery: Stop using git-fat - https://phabricator.wikimedia.org/T328472 (10VirginiaPoundstone) [19:52:17] (03PS6) 10Clare Ming: Add Metrics Platform fragments by platform only [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) [19:53:42] (03CR) 10Clare Ming: Add Metrics Platform fragments by platform only (035 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [20:35:03] (03CR) 10Clare Ming: "just a general Q about organization -- should we have separate web + app folders in the MP analytics directory?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [21:05:41] 10Data-Platform-SRE: Decommission wdqs10[03-05] - https://phabricator.wikimedia.org/T344198 (10bking) wdqs1005 is alerting for ipmi, and based on [[ https://www.dell.com/community/en/conversations/poweredge-hardware-general/can't-update-idrac-firmware/647f768bf4ccf8a8de49ded5 | this search result ]] it seems lik... [21:47:31] amigo milimetric [21:47:55] do you remember what was the tool ellery used for this visualization: https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream#/media/File:London_clickstream.png cc joal [22:21:30] nvm it must have been pyplot: https://plotly.com/python/sankey-diagram/ [22:23:28] that, i thought, looked too good to be pyplot [23:35:28] Have y'all seen ? It seems to be Jupyter Notebooks with the ability to live collab on a notebook and save them in google drive.