[02:03:51] RECOVERY - Check systemd state on analytics1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:44] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:16:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [03:01:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [03:06:44] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:11:44] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:36:44] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:41:44] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:14:24] 10Data-Engineering, 10Data-Persistence, 10Data-Services, 10cloud-services-team, 10Patch-For-Review: Wiki-replicas: investigate why some maintenance operations can cause unwanted pybal impact - https://phabricator.wikimedia.org/T337721 (10Marostegui) @BBlack the fix at T337446#8888642 can now be reverted... [05:25:03] 10Data-Engineering, 10AQS2.0, 10API Platform (AQS 2.0 Roadmap), 10Epic, and 2 others: AQS 2.0: Device Analytics service - https://phabricator.wikimedia.org/T288298 (10SGupta-WMF) [06:04:54] 10Data-Engineering, 10DBA: Clean up clouddb1021 - https://phabricator.wikimedia.org/T337961 (10Marostegui) p:05Triage→03Medium a:03Ladsgroup [07:10:12] !log move varnishkafka instances on cp4037 to PKI TLS certs [07:10:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:10:17] Cc: btullis, joal --^ [07:10:35] I tried to check data from kafkacat on stat1004, all good afaics [07:11:56] 10Data-Engineering, 10Data-Platform-SRE, 10SRE, 10Traffic: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 (10elukey) All vk instances running on cp4037, next steps: 1) Monitor cp4037 to verify that nothing explodes. 2) Extend the change to ulsfo and monitor. 3) Extend the change to a... [07:13:10] set "loadByPeriod(P30D+future), dropForever" for webrequest_sampled_live in druid-analytics - T337460 [07:13:11] T337460: Increase webrequest_sampled_live Druid datasource's retention - https://phabricator.wikimedia.org/T337460 [07:39:49] 10Data-Engineering, 10Data-Platform-SRE: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10Stevemunene) [07:41:44] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:14] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: Improve mediawiki-event-enrichment test suite - https://phabricator.wikimedia.org/T328013 (10CodeReviewBot) tchin merged https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_request... [08:25:28] elukey: Great work! On the lookout for any explosions now :-) [08:26:18] How long do you think we should give it until we roll out to ulsfo ? [08:26:58] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Rebuild all hadoop packages for bullseye with different distribution suffix mechanism - https://phabricator.wikimedia.org/T337465 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/bigtop/-/merge_requests/... [08:29:22] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Rebuild all hadoop packages for bullseye with different distribution suffix mechanism - https://phabricator.wikimedia.org/T337465 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/bigtop/-/merge_requests/... [08:29:46] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Rebuild all hadoop packages for bullseye with different distribution suffix mechanism - https://phabricator.wikimedia.org/T337465 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/bigtop/-/merge_requests/... [08:33:46] ack elukey - Thank you :) [09:07:49] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10ayounsi) `name=IPv4,lang=json { "event_type": "purge", "tag2": 1, "as_src": 48551, "as_dst": 0, "comms": "", "as_path": ""... [09:09:35] btullis: o/ no idea, I thought about a couple of days maybe? Just to be sure [09:27:21] 10Data-Engineering, 10Data-Platform-SRE: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10BTullis) >>! In T336036#8901631, @BTullis wrote: > I see that we are affected by the same issue as {T336281} in that two consecutive runs of puppet add and then remove hive > > The last time we... [09:28:45] 10Data-Engineering, 10Data-Platform-SRE: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10BTullis) You can also add stat1009 to the refinery deployment targets here: https://gerrit.wikimedia.org/r/admin/repos/analytics/refinery/scap,general [09:51:15] (03CR) 10DCausse: [C: 03+1] Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [09:51:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [09:54:58] 10Data-Engineering, 10Data-Engineering-Jupyter: Read and process parquet files from Jupyter notebooks - https://phabricator.wikimedia.org/T338932 (10gmodena) [10:49:06] RECOVERY - Check systemd state on analytics1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:10] PROBLEM - Check systemd state on analytics1059 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:32] I'm planning to push out version 0.0.18 of conda-analytics to the prod cluster today, unless anyone objects. Replacing 0.0.13 and skipping all of the releases in between. [10:58:03] The only changes are: 1) Adding the spark3-yarn shuffler jar, but not enabling it. 2) Adding the iceberg 1.2.1 jar. 3) Upgrading ipykernel from 6.18.0 to 6.19.4 [11:41:44] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:55:56] 10Data-Engineering, 10Data-Engineering-Jupyter: Read and process parquet files from Jupyter notebooks - https://phabricator.wikimedia.org/T338932 (10Ottomata) For accessing HDFS via pyarrow: `lang=python $ CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob` ipython from pyarrow import fs hdfs = fs.HadoopFi... [12:03:53] (03PS1) 10Stevemunene: Add stat1009 to scap targets [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/928918 (https://phabricator.wikimedia.org/T336036) [12:15:38] 10Data-Engineering, 10Data-Engineering-Jupyter: Read and process parquet files from Jupyter notebooks - https://phabricator.wikimedia.org/T338932 (10gmodena) ` $ CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob` ipython ` This is what I want to avoid, and use the hosted jupyter hub session instead. `workfl... [12:19:54] RECOVERY - Check systemd state on analytics1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:44] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:23:48] (03CR) 10Btullis: [C: 03+1] "Looks good to me. Worth adding to the ops week deployment etherpad as a note for the next person who deploys refinery." [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/928918 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [12:25:19] !log beginning rollout of conda-analytics 0.0.18 to hadoop-workers [12:25:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:26:44] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:27:46] PROBLEM - Check systemd state on analytics1059 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:28] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10BTullis) I'm pushing out version 0.0.18 of conda-analytics, so the spark3 yarn shuffler jar will be present on all... [12:31:57] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Alerting: Reduce alert noise associated with individual users' jupyterhub-singleuser services - https://phabricator.wikimedia.org/T336951 (10BTullis) 05Open→03Resolved [12:45:53] 10Data-Engineering, 10Data-Engineering-Jupyter: Read and process parquet files from Jupyter notebooks - https://phabricator.wikimedia.org/T338932 (10Ottomata) Aye ya, if you want to do it with pyarrow directly, you can use [[ https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/blob/main/workfl... [12:57:29] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Platform Team Workboards (Clinic Duty Team): Adopt conventions for server receive and client/event timestamps in non analytics event schemas - https://phabricator.wikimedia.org/T267648 (10Ottomata) [13:00:12] !log rolled out conda-analytics 0.0.18 to analytics-airflow and hadoop-coordinator [13:00:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:07:37] 10Data-Engineering: Increase webrequest_sampled_live Druid datasource's retention - https://phabricator.wikimedia.org/T337460 (10elukey) Next steps: * Wait a couple of weeks to get `webrequest_sampled_live` retention to 30 days, check druid metrics etc.. * Discuss with DE and SRE about deprecating `webrequest_s... [13:09:14] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutilities-python: review and clean up in preparation for a GA release. - https://phabricator.wikimedia.org/T336488 (10JArguello-WMF) [13:36:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:37:46] !log restarting hive-server2 and hive-metastore on an-coord1002 prior to failover. [13:37:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:45:55] !log fixed broken graphs in the varnishkafka's dashboard [13:45:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:52:13] 10Data-Engineering, 10Data-Platform-SRE: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10BTullis) 05Open→03Resolved a:03BTullis I think that this is resolved, isn't it? I'm not seeing any regular cronspam from an-test hosts. Please feel free to reopen if I have mi... [14:04:12] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13), 10Patch-For-Review: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817 (10Ottomata) I'm going to wait for https://gerrit.wikimedia.org/r/929430 to roll out with wmf... [14:04:32] (03PS5) 10Ottomata: Remove is_registered field from user entity fragment [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/923253 (https://phabricator.wikimedia.org/T337395) (owner: 10TChin) [14:21:25] (03CR) 10Ottomata: [C: 03+2] Remove is_registered field from user entity fragment [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/923253 (https://phabricator.wikimedia.org/T337395) (owner: 10TChin) [14:22:02] (03Merged) 10jenkins-bot: Remove is_registered field from user entity fragment [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/923253 (https://phabricator.wikimedia.org/T337395) (owner: 10TChin) [14:25:59] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10gmodena) > Let's verify this with Search and SRE ServiceOps @JMeybohm @dcausse we'd like to pick this up in sprint 14B. Would you h... [14:28:17] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10dcausse) @gmodena no concerns from my side, main question I'd have when do we consider the flink-app deploymen [14:42:37] (03PS1) 10Aqu: Use canonical_data countries maintained by analytics-product [analytics/refinery] - 10https://gerrit.wikimedia.org/r/929723 (https://phabricator.wikimedia.org/T338033) [14:44:28] (03PS2) 10Aqu: Use canonical_data countries maintained by analytics-product [analytics/refinery] - 10https://gerrit.wikimedia.org/r/929723 (https://phabricator.wikimedia.org/T338033) [14:50:42] (03PS1) 10Snwachukwu: Update changelog for v0.2.16 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/929724 [14:51:36] (03CR) 10Snwachukwu: [C: 03+2] Update changelog for v0.2.16 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/929724 (owner: 10Snwachukwu) [14:53:38] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10MW-1.41-notes (1.41.0-wmf.12; 2023-06-06), 10Patch-For-Review: Remove user is_registered field from mediawiki/page/change schema - https://phabricator.wikimedia.org/T337395 (10Ottomata) [14:54:45] (03CR) 10Stevemunene: Add stat1009 to scap targets (031 comment) [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/928918 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [14:55:30] (03CR) 10Stevemunene: [C: 03+2] Add stat1009 to scap targets [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/928918 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [14:55:36] (03CR) 10Stevemunene: [V: 03+2 C: 03+2] Add stat1009 to scap targets [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/928918 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [15:01:08] (03Merged) 10jenkins-bot: Update changelog for v0.2.16 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/929724 (owner: 10Snwachukwu) [15:03:02] btullis: we're in https://meet.google.com/rnb-jtio-dcy for the weekly data platforms SRE meeting [15:03:11] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10MW-1.41-notes (1.41.0-wmf.12; 2023-06-06), 10Patch-For-Review: Remove user is_registered field from mediawiki/page/change schema - https://phabricator.wikimedia.org/T337395 (10Ottomata) [15:03:25] !log failing over the analytics-hive cname to an-coord1002 [15:03:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:05:41] !log dropping hive table event.mediawiki_page_change_v1 to pick up backwards incompatible schema change - T337395 [15:05:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:05:44] T337395: Remove user is_registered field from mediawiki/page/change schema - https://phabricator.wikimedia.org/T337395 [15:07:55] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10MW-1.41-notes (1.41.0-wmf.12; 2023-06-06), 10Patch-For-Review: Remove user is_registered field from mediawiki/page/change schema - https://phabricator.wikimedia.org/T337395 (10Ottomata) ` [@an-launcher1002:/home/otto] $ sudo -u analytics ke... [15:08:28] 10Data-Engineering, 10Data-Platform-SRE, 10Patch-For-Review: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10Stevemunene) [15:13:46] !log deploying refinery source [15:13:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:19:25] !log drop event.mediawiki_page_outlink_topic_prediction_change table and data - T337395 [15:19:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:19:27] T337395: Remove user is_registered field from mediawiki/page/change schema - https://phabricator.wikimedia.org/T337395 [15:19:47] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10MW-1.41-notes (1.41.0-wmf.12; 2023-06-06): Remove user is_registered field from mediawiki/page/change schema - https://phabricator.wikimedia.org/T337395 (10Ottomata) `lang=sql drop table event.mediawiki_page_outlink_topic_prediction_change; `... [15:20:48] Starting build #121 for job analytics-refinery-maven-release-docker [15:20:54] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10MW-1.41-notes (1.41.0-wmf.12; 2023-06-06): Remove user is_registered field from mediawiki/page/change schema - https://phabricator.wikimedia.org/T337395 (10Ottomata) [15:21:43] (03PS1) 10DCausse: mediawiki/revision/score: add the dt field [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929733 (https://phabricator.wikimedia.org/T267648) [15:26:51] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10Gehel) Before we can active the spark 3 shuffler, we need to migrate all jobs to spark 3. [15:29:19] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines (Sprint 14), 10Patch-For-Review: Wikistats Bug: Small countries not displayed on the map - https://phabricator.wikimedia.org/T338033 (10Antoine_Quhen) I'm proposing with those patches: * https://github.com/wikimedia-research/canonical-data/... [15:34:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:34:35] Project analytics-refinery-maven-release-docker build #121: 09SUCCESS in 13 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/121/ [15:36:44] (SystemdUnitFailed) firing: (4) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:37:29] Starting build #80 for job analytics-refinery-update-jars-docker [15:37:49] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.16 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/928923 [15:37:50] Project analytics-refinery-update-jars-docker build #80: 09SUCCESS in 20 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/80/ [15:42:11] (03CR) 10Snwachukwu: [C: 03+2] Add refinery-source jars for v0.2.16 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/928923 (owner: 10Maven-release-user) [15:42:53] (03CR) 10Snwachukwu: [V: 03+2 C: 03+2] Add refinery-source jars for v0.2.16 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/928923 (owner: 10Maven-release-user) [15:45:33] !log Deployed refinery-source using jenkins [15:45:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:46:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:44] (SystemdUnitFailed) firing: (4) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:09:39] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10MW-1.41-notes (1.41.0-wmf.12; 2023-06-06): Remove user is_registered field from mediawiki/page/change schema - https://phabricator.wikimedia.org/T337395 (10Ottomata) Recreated Hive tables look good: no is_registered field! I think we are done. [16:20:58] (03CR) 10Ottomata: mediawiki/revision/score: add the dt field (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929733 (https://phabricator.wikimedia.org/T267648) (owner: 10DCausse) [16:29:48] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines (Sprint 14), 10Patch-For-Review: Wikistats Bug: Small countries not displayed on the map - https://phabricator.wikimedia.org/T338033 (10mpopov) Thank you @Antoine_Quhen! I really really like this idea. I'm wondering if this would be a good... [16:41:11] !log deploying refinery for weekly train [16:41:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:52:13] (DiskSpace) firing: Disk space an-launcher1002:9100:/srv 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-launcher1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:56:43] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Data-Persistence (work done), 10Platform Engineering Roadmap Decision Making, and 7 others: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10Ladsgroup) [16:57:00] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Data-Persistence (work done), 10Platform Engineering Roadmap Decision Making, and 7 others: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10Ladsgroup) [17:02:34] PROBLEM - Disk space on an-launcher1002 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [17:11:47] kinit [17:11:50] woops [17:16:40] Password: [17:17:02] hunter2 [17:18:34] :D [17:26:05] 10Analytics, 10Data-Engineering-Icebox, 10Multi-Content-Revisions (Tech Debt): Adapt mediawiki history for MCR - https://phabricator.wikimedia.org/T238615 (10Ladsgroup) [17:26:14] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Data-Persistence (work done), 10Platform Engineering Roadmap Decision Making, and 7 others: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10Ladsgroup) [17:26:25] 10Analytics, 10Data-Engineering-Icebox, 10Multi-Content-Revisions (Tech Debt): Adapt mediawiki history for MCR - https://phabricator.wikimedia.org/T238615 (10Ladsgroup) This is not really a blocker of {T215466}, removed as parent. [17:26:48] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Data-Persistence (work done), 10Platform Engineering Roadmap Decision Making, and 7 others: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10Ladsgroup) [17:27:28] 10Data-Engineering, 10Data-Platform-SRE: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10Ottomata) Scap deployment of analytics/refinery to stat1009 failed for me today due to: ` “git: ‘fat’ is not a git command. ` Looks it should be installed by puppet standard_packages class, but... [17:33:46] 10Data-Engineering, 10Data-Platform-SRE: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10bking) @Ottomata we ran into this problem when deploying WDQS hosts on Bullseye. The default Bullseye config removes all Puppet 2 config. [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/9... [17:38:29] 10Data-Engineering, 10Data-Platform-SRE: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10Ottomata) Kay. @BTullis @Stevemunene FYI I removed git-fat package from stat1009, so the next time a deploy happens this will bite us again, until we do as @bking suggests. [18:16:26] 10Analytics, 10Data-Engineering-Icebox: Make it easy to debug eventlogging instrumentation, add ability to send client canary events. - https://phabricator.wikimedia.org/T253239 (10Ottomata) 05Open→03Declined Declining, not because its a bad idea, but because this is unlikely to be worked on ever. [18:19:38] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream: Enable canary events for all streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) [18:19:41] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream: Refine event pipeline at this time refines data in hourly partitions without knowing if the partition is complete - https://phabricator.wikimedia.org/T252585 (10Ottomata) 05Open→03Resolved a:03Ottomata I'm going to resolve this t... [18:20:29] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Event-Platform Value Stream, and 3 others: Automate ingestion and refinement into Hive of event data from Kafka using stream configs and canary/heartbeat events - https://phabricator.wikimedia.org/T251609 (10Ottomata) [18:30:55] 10Data-Engineering, 10Data-Platform-SRE: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10Stevemunene) >>! In T336036#8928560, @Ottomata wrote: > Kay. > > @BTullis @Stevemunene FYI I removed git-fat package from stat1009, so the next time a deploy happens this will bite us again, un... [18:51:32] I'm looking into the free disk space on an-launcher1002 alert. [18:52:11] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10CodeReviewBot) gmodena opened https://gitlab.wiki... [18:52:13] (DiskSpace) firing: (2) Disk space an-launcher1002:9100:/ 5.582% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-launcher1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [18:52:21] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10CodeReviewBot) [18:54:10] 10Data-Engineering, 10DBA: Clean up clouddb1021 - https://phabricator.wikimedia.org/T337961 (10Ladsgroup) ruwikinews' externallinks: ` 218G externallinks.ibd ` It needs to be compressed. I'll do a couple more tables too [18:54:18] https://phabricator.wikimedia.org/T339002 [19:03:37] 10Data-Engineering-Planning, 10Data Pipelines: Review Superset permissions and assign roles as appropriate - https://phabricator.wikimedia.org/T328457 (10spatton) Hi, could I (username //spatton//) please get access to the `sql_lab` role, too? Thanks! [19:03:42] !log freeing up space in /srv on an-launcher1002 with `btullis@an-launcher1002:/srv/airflow-analytics/logs/scheduler$ find -maxdepth 1 -type d -mtime +15 -print0 | xargs -0 sudo rm -rf` for T339002 [19:03:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:03:45] T339002: The /srv volume is full on an-launcher1002 - https://phabricator.wikimedia.org/T339002 [19:04:52] 10Data-Engineering-Planning, 10Data Pipelines: Review Superset permissions and assign roles as appropriate - https://phabricator.wikimedia.org/T328457 (10BTullis) @spatton - Thats' done for you now. [19:06:17] 10Data-Engineering-Planning, 10DBA: Move Mediawiki QueryPages computation to Hadoop - https://phabricator.wikimedia.org/T309738 (10Ladsgroup) Top priorities: ` '+enwiki' => [ 'Lonelypages' => 'monthly', 'Mostcategories' => 'monthly', 'Mostlinkedtemplates'... [19:07:13] (DiskSpace) firing: (2) Disk space an-launcher1002:9100:/ 3.256% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-launcher1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:16:00] (03PS1) 10TChin: Skip deterministic types tests for legacy schemas [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929782 (https://phabricator.wikimedia.org/T338228) [19:21:50] (03CR) 10Ottomata: "Nice." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929782 (https://phabricator.wikimedia.org/T338228) (owner: 10TChin) [19:22:41] (03CR) 10Ottomata: Skip deterministic types tests for legacy schemas (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929782 (https://phabricator.wikimedia.org/T338228) (owner: 10TChin) [19:23:57] (03CR) 10Ottomata: Skip deterministic types tests for legacy schemas (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929782 (https://phabricator.wikimedia.org/T338228) (owner: 10TChin) [19:25:49] RECOVERY - Disk space on an-launcher1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [19:27:13] (DiskSpace) resolved: Disk space an-launcher1002:9100:/ 4.904% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-launcher1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:27:53] !log restarting the hive-server2 and hive-metastore services on an-coord1001 [19:27:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:30:22] (03CR) 10TChin: Skip deterministic types tests for legacy schemas (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929782 (https://phabricator.wikimedia.org/T338228) (owner: 10TChin) [19:34:43] (03PS2) 10TChin: Skip deterministic types tests for legacy schemas [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929782 (https://phabricator.wikimedia.org/T338228) [19:35:04] (03CR) 10CI reject: [V: 04-1] Skip deterministic types tests for legacy schemas [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929782 (https://phabricator.wikimedia.org/T338228) (owner: 10TChin) [19:47:29] 10Data-Engineering, 10Advanced-Search, 10All-and-every-Wikisource, 10ArticlePlaceholder, and 64 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10SBisson) [19:47:34] (03PS1) 10Gmodena: page_change: add a flag for bad revision data. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929786 [19:47:58] (03CR) 10CI reject: [V: 04-1] page_change: add a flag for bad revision data. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929786 (owner: 10Gmodena) [19:51:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:54:20] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B): Improve Event Platform and MediaWiki Event Enrichment wikitech documentation - https://phabricator.wikimedia.org/T329629 (10Ottomata) - Added a [[ https://wikitech.wikimedia.org/wiki/Event_Platform/Flaws | Flaws ]] page - Added a page de... [20:05:03] Krinkle: o/ looking for docs on how to get folks added to NPM wikimedia org so they can publish jsonschema-tools releases [20:05:22] is there a documented process somewhere? I know you helped me with this before... [20:12:20] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10gmodena) > I'm picking up this phab up in sprint... [20:12:50] (03PS2) 10Gmodena: page_change: add a flag for bad revision data. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929786 (https://phabricator.wikimedia.org/T309699) [20:13:16] (03CR) 10CI reject: [V: 04-1] page_change: add a flag for bad revision data. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929786 (https://phabricator.wikimedia.org/T309699) (owner: 10Gmodena) [20:14:59] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10CodeReviewBot) gmodena updated https://gitlab.wik... [20:16:05] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10Ottomata) Nice! Small correction: > Content sup... [20:16:49] (03CR) 10Ottomata: "Right, it will fail until we release version 0.13.0 on npm. After that I think it should work?" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929782 (https://phabricator.wikimedia.org/T338228) (owner: 10TChin) [20:17:03] (03CR) 10Ottomata: Skip deterministic types tests for legacy schemas (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929782 (https://phabricator.wikimedia.org/T338228) (owner: 10TChin) [20:21:50] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: jsonschema-tools deterministic schema test should fail if a object field does not have schema - https://phabricator.wikimedia.org/T338228 (10Ottomata) [20:57:05] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B): Improve Event Platform and MediaWiki Event Enrichment wikitech documentation - https://phabricator.wikimedia.org/T329629 (10Ottomata) Added [[ https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#MediaWiki_state_fragment... [21:03:38] RECOVERY - Hadoop NodeManager on analytics1060 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:54:55] ottomata: https://wikitech.wikimedia.org/wiki/Npm_registry [21:56:09] ottomata: publishing rights are per-package in npm, not like git or phab [21:57:12] So as package owner you can do that from the command line using something like "npm owners add ..." from the directory where the package's package.json is [21:57:33] You can run npm whoami to confirm you're logged in if that doesn't work [21:57:45] Let me know if you need help :) [21:58:49] Krinkle: Roan is working on the problem now. There is a slack thread about this now. I passed on your link as none of us had tracked that bit down yet. :) [23:51:44] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed