[01:30:22] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10nshahquinn-wmf) p:05Triage→03Low [03:32:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:34:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:47:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:15:42] (SystemdUnitFailed) firing: user-runtime-dir@499.service Failed on an-worker1145:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:16:57] PROBLEM - Check systemd state on an-worker1145 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:20:42] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on an-worker1145:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:45:42] (SystemdUnitFailed) firing: (3) systemd-timedated.service Failed on an-worker1145:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:50:42] (SystemdUnitFailed) firing: (3) systemd-timedated.service Failed on an-worker1145:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:55:42] (SystemdUnitFailed) firing: (3) systemd-timedated.service Failed on an-worker1145:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:20:42] (SystemdUnitFailed) firing: (3) systemd-timedated.service Failed on an-worker1145:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:21:35] RECOVERY - Check systemd state on an-worker1145 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:25:42] (SystemdUnitFailed) firing: (3) systemd-timedated.service Failed on an-worker1145:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:25:57] PROBLEM - Check systemd state on an-worker1145 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:30:42] (SystemdUnitFailed) firing: (3) systemd-timedated.service Failed on an-worker1145:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:50:42] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on an-worker1145:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:55:42] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on an-worker1145:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:01:57] (SystemdUnitFailed) firing: (3) systemd-timedated.service Failed on an-worker1145:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:10:42] (SystemdUnitFailed) firing: (3) systemd-timedated.service Failed on an-worker1145:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:32:47] 10Data-Engineering, 10Movement-Insights, 10Product-Analytics, 10Research-Freezer: Investigate relation of UA deprecation to increase in automated traffic and reduction in unique devices - https://phabricator.wikimedia.org/T336715 (10kostajh) [06:34:03] 10Data-Engineering, 10Anti-Harassment, 10SRE, 10Traffic, and 2 others: Include User-Agent Client Hints in WebRequest logs - https://phabricator.wikimedia.org/T337947 (10kostajh) [06:40:42] (SystemdUnitFailed) firing: (3) systemd-timedated.service Failed on an-worker1145:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:45:42] (SystemdUnitFailed) firing: (3) systemd-timedated.service Failed on an-worker1145:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:50:42] (SystemdUnitFailed) firing: (4) systemd-timedated.service Failed on an-worker1145:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:55:42] (SystemdUnitFailed) firing: (4) systemd-timedated.service Failed on an-worker1145:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:01:55] checking out an-worker1145 --^ [07:05:42] (SystemdUnitFailed) firing: user@42437.service Failed on an-worker1145:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:15:42] (SystemdUnitFailed) firing: (2) user@42437.service Failed on an-worker1145:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:20:42] (SystemdUnitFailed) firing: (2) user@42437.service Failed on an-worker1145:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:28:12] 10Data-Platform-SRE: an-worker1145: soft lockup. - https://phabricator.wikimedia.org/T345413 (10Stevemunene) [07:31:05] 10Data-Platform-SRE: an-worker1145: soft lockup. - https://phabricator.wikimedia.org/T345413 (10Stevemunene) [07:32:55] 10Data-Platform-SRE, 10Research, 10WMDE-TechWish-Maintenance-2023: Publish dump scraper reports - https://phabricator.wikimedia.org/T341751 (10awight) [07:40:42] (SystemdUnitFailed) firing: (2) user@42437.service Failed on an-worker1145:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:43:33] !log powercycle an-worker1145.eqiad.wmnet host cpus soft lockup T345413 [07:43:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:43:36] T345413: an-worker1145: soft lockup. - https://phabricator.wikimedia.org/T345413 [07:44:55] PROBLEM - Host an-worker1145 is DOWN: PING CRITICAL - Packet loss = 100% [07:45:42] (SystemdUnitFailed) resolved: (2) user@42437.service Failed on an-worker1145:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:46:58] (03CR) 10Phuedx: [C: 03+1] Skip schema-deterministic-types for metrics_event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/954097 (https://phabricator.wikimedia.org/T344511) (owner: 10TChin) [07:47:15] RECOVERY - Host an-worker1145 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [07:47:45] RECOVERY - Check systemd state on an-worker1145 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:39] 10Data-Platform-SRE: an-worker1145: soft lockup. - https://phabricator.wikimedia.org/T345413 (10Stevemunene) p:05Triage→03Medium The host seems to be back in service {F37646131} However, leaving this open incase it re appears within the day and for further conversations on the host. [08:15:49] 10Data-Platform-SRE, 10Research, 10WMDE-TechWish-Maintenance-2023: Publish dump scraper reports - https://phabricator.wikimedia.org/T341751 (10awight) [08:28:04] 10Data-Platform-SRE, 10Research, 10WMDE-TechWish-Maintenance-2023: Publish dump scraper reports - https://phabricator.wikimedia.org/T341751 (10awight) [08:30:41] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10elukey) I don't see znodes on the new cluster: ` elukey@flink-zk1001:~$ sudo -u zookeeper /usr/share/zookeeper/bin/zkCli.sh SLF4J:... [08:36:36] 10Data-Platform-SRE, 10Research, 10WMDE-TechWish-Maintenance-2023: Publish dump scraper reports - https://phabricator.wikimedia.org/T341751 (10awight) [08:36:42] (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:39:28] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10elukey) Something odd: ` root@deploy1002:~# kubectl logs flink-app-wdqs-c7f6bff77-xlgs5 -n rdf-streaming-updater flink-main-contain... [08:51:42] (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:59:00] 10Data-Platform-SRE, 10Data-Services: Queries to externallinks table fail following schema changes - https://phabricator.wikimedia.org/T344866 (10Ladsgroup) The schema change is now run on every section and I ran the maintain-views afterwards. So this should be done now. [08:59:29] 10Data-Platform-SRE, 10Data-Services: Queries to externallinks table fail following schema changes - https://phabricator.wikimedia.org/T344866 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup [10:09:50] 10Data-Engineering, 10Metrics Platform Backlog, 10Patch-For-Review: Make jsonschema-tools merge values of enums when merging allOf - https://phabricator.wikimedia.org/T345317 (10phuedx) :point_up: Wrong task! [12:19:42] (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:24:42] (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:02] 10Data-Platform-SRE: Investigate an-presto1002 failures - https://phabricator.wikimedia.org/T344808 (10Stevemunene) an-presto1002 was showing similar memory utilisation errors on 2023-09-01 with the latest one at time of writing seen here {F37646531} with a peak usage on 2023-09-01 12:16 UTC that led to the syst... [12:36:47] --^ presto-server.service errors documented above on the task tracking this. [12:49:32] 10Data-Platform-SRE: Bring Hadoop workers an-worker11[49-56] into service - https://phabricator.wikimedia.org/T343762 (10Stevemunene) [13:34:13] 10Data-Platform-SRE: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361 (10bking) Unfortunately, the patch had to be rolled back. The error we received was: `Aug 31 20:18:34 wdqs1004 wdqs-blazegraph[1142014]: Error: Could not find or load main class org.eclipse.jetty.runner.Runner` [13:43:15] (03PS7) 10Phuedx: Add analytics/metrics_platform/{app,web}/{click,view} schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) [14:07:41] 10Data-Engineering, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Set data permission on new snapshot generation (discovery.wikibase_rdf) - https://phabricator.wikimedia.org/T342416 (10Gehel) 05Open→03Resolved [14:17:42] (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:22:42] (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:34:51] (03PS8) 10Phuedx: Add analytics/metrics_platform/{app,web}/{click,view} schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) [14:42:23] (03CR) 10Clare Ming: "what's the rationale to go from 'action' to 'event_type'?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [14:42:39] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk2001.codfw.wmnet with OS bo... [15:04:24] (03PS9) 10Phuedx: Add analytics/metrics_platform/{app,web}/{click,view} schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) [15:04:58] (03CR) 10Phuedx: Add analytics/metrics_platform/{app,web}/{click,view} schemas (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [15:57:33] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk2001.codfw.wmnet with OS bookwo... [16:03:52] (03CR) 10Clare Ming: Add analytics/metrics_platform/{app,web}/{click,view} schemas (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [16:16:30] (03CR) 10Phuedx: Add analytics/metrics_platform/{app,web}/{click,view} schemas (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [16:27:48] 10Data-Platform-SRE, 10Discovery-Search: rolling operation: Detect and remove failed index aliases - https://phabricator.wikimedia.org/T345449 (10bking) [16:28:14] 10Data-Platform-SRE, 10Discovery-Search: Rolling operation cookbook: Detect and remove failed index aliases - https://phabricator.wikimedia.org/T345449 (10bking) [16:42:28] (03PS10) 10Phuedx: Add analytics/metrics_platform/{app,web}/{click,view} schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) [16:42:56] (03CR) 10CI reject: [V: 04-1] Add analytics/metrics_platform/{app,web}/{click,view} schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [16:43:37] (03PS11) 10Phuedx: Add analytics/metrics_platform/{app,web}/{click,view} schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) [17:16:58] 10Data-Engineering, 10Cassandra: Create keyspace and table for Knowledge Gaps - https://phabricator.wikimedia.org/T340494 (10fkaelin) Following up on this, are there open questions/tasks regarding the creation/support of this dataset? [17:34:09] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Wmfdata's presto.run fails with Urllib3 v2 or higher - https://phabricator.wikimedia.org/T345309 (10nshahquinn-wmf) 05Open→03Resolved a:03nshahquinn-wmf Wmfdata now pins Urllib3 below v2 ([PR 45](https://github.com/wikimedia/wmfdata-python/pull... [21:15:00] 10Data-Platform-SRE: Service implementation for wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T345475 (10RKemper) [22:13:04] 10Data-Engineering, 10Movement-Insights, 10Product-Analytics, 10Research-Freezer: Investigate relation of UA deprecation to increase in automated traffic and reduction in unique devices - https://phabricator.wikimedia.org/T336715 (10Mayakp.wiki) There are several questions that would be great to get more a... [22:39:42] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10nshahquinn-wmf) [22:40:17] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10nshahquinn-wmf) [22:40:51] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10nshahquinn-wmf) @BTullis do you have any idea how to make the CNAME work here? [22:41:50] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10nshahquinn-wmf) p:05Triage→03Medium This will lead to unexpected breakage and need an immediate patch at some when the coordinator ro...