[00:02:44] (SystemdUnitFailed) firing: (10) prometheus-node-textfile-prometheus-check-certificate-expiry.service Failed on an-airflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:03:17] PROBLEM - Check systemd state on an-airflow1004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-textfile-prometheus-check-certificate-expiry.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:29] PROBLEM - Check systemd state on an-airflow1006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-textfile-prometheus-check-certificate-expiry.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-textfile-prometheus-check-certificate-expiry.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:03] PROBLEM - Check systemd state on an-airflow1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-textfile-prometheus-check-certificate-expiry.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:09] PROBLEM - Check systemd state on an-airflow1005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-textfile-prometheus-check-certificate-expiry.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:12:01] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [03:32:44] (SystemdUnitFailed) firing: (11) prometheus-node-textfile-prometheus-check-certificate-expiry.service Failed on an-airflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:55:58] 10Data-Engineering, 10Web-Team-Backlog (Needs Prioritization (Tech)): Deal with minified scripts in JS error logging - https://phabricator.wikimedia.org/T520 (10tstarling) [05:12:01] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [06:02:44] I'll check out why these systemd alerts fired [06:10:14] the issue is a difference of version with the cryptography python package causing the metrics exporter to fail with older versions. I tested the script on an-test-client1002, which is the _only_ node with a more recent version [06:13:29] RECOVERY - Check systemd state on an-airflow1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:13:42] ^ I tested a manual change, which worked. PR incoming [06:17:44] (SystemdUnitFailed) firing: (11) prometheus-node-textfile-prometheus-check-certificate-expiry.service Failed on an-airflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:18:41] https://gerrit.wikimedia.org/r/c/operations/puppet/+/967333 to the rescue [08:14:43] I'm going to decommission kafka-jumbo100[1-6] this morning [08:15:13] gogogo [08:18:19] is there anything I should mute or do before doing so? [08:21:34] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Follow up on rdf-streaming-updater failure 2023-10-17 - https://phabricator.wikimedia.org/T349147 (10Gehel) a:03bking [08:25:57] I can't think of anything, the cookbook already sets downtime. I think alertmanager should be ok, but might want a quick scan of the alerts repo, just to be sure. [08:38:11] ah, we're back to the issue of "I don't yet have access to the pws passwords". btullis: can you enter the management password in my screen, on cumin1001? (3969281.pts-13.cumin1001) [08:39:06] Ah yes, I forgot about that. [08:39:17] Will do. [08:40:42] Done. [08:43:59] thanks! [08:59:20] I aborted the decommission cookbook as we still had matched of the 6 broker IPs in `deployment-charts`. This is fixed in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/967400 [09:11:10] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) I have annouced a maintenance window for **Tuesday 24th Oct at 09:30 UTC** when I will carry out this work. [09:11:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:52] RECOVERY - Check systemd state on an-airflow1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:14] RECOVERY - Check systemd state on an-airflow1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:18] RECOVERY - Check systemd state on an-airflow1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:44] (SystemdUnitFailed) firing: (10) prometheus-node-textfile-prometheus-check-certificate-expiry.service Failed on an-airflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:13:21] btullis: habemus x509 expiry metrics https://thanos.wikimedia.org/graph?g0.expr=x509_cert_expiry&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D [09:14:58] Nice. Excellent work. [09:15:52] however, aren't we supposed to get metrics from all worker hosts? [09:16:23] I might have deployed the timer to a smaller subset of hosts than necessary [09:16:32] Hi folks, [09:16:32] elukey and I are trying to access the `https://wikimedia.org/api/rest_v1/metrics/pageviews` enpoint on k8s using the `mw-api-int-async-ro` envoy listener and url `http://localhost:6500/api/rest_v1/metrics/pageviews`. [09:16:32] Is `mw-api-int-async-ro` the right listener to access this enpoint or should we use `aqs` instead? [09:16:46] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [09:16:49] No, I don't think so. Skein itself is only installed on airflow hosts, which submit jobs to the hadoop cluster(s). [09:17:14] 👍 [09:17:44] (SystemdUnitFailed) firing: (10) prometheus-node-textfile-prometheus-check-certificate-expiry.service Failed on an-airflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:41:51] btullis: I've sent out a tiny PR (https://gerrit.wikimedia.org/r/c/operations/puppet/+/967404) that should add the certificate path to https://grafana-rw.wikimedia.org/d/980N6H7Iz/skein-certificate-expiry?orgId=1, to help us know from where to run the skein cert renewal command, if push comes to shove [09:51:19] kevinbazira: My gut feeling is that it should be the `aqs` listener, but I'm not 100% certain. [09:52:27] btullis: thanks! lets try `aqs` [09:53:35] kevinbazira: OK, if you get it working, maybe you could add an example of how to use aqs under this one: https://wikitech.wikimedia.org/wiki/Envoy#Example_(calling_mw-api) [09:54:03] Let us know. [09:54:31] btullis: ok, we'll let you know how it goes. [10:32:44] (SystemdUnitFailed) firing: (7) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:34:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:45:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:44] (SystemdUnitFailed) firing: (7) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:21:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [11:51:24] btullis I have a first draft of an alert of the expiry date of our skein certificates: https://gerrit.wikimedia.org/r/c/operations/alerts/+/967409. I took inspiration from other alerts. but I'm not sure of the team name (data-engineering vs sre) and such. Free free to nitpick! [12:03:01] 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02), 10Patch-For-Review: Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10Sfaci) a:05SGupta-WMF→03EChukwukere-WMF [12:32:15] 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02), 10Patch-For-Review: Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10Lokal_Profil) >>! In T347899#9258560, @Sfaci wrote: > Just wondering, for example, why this i... [12:51:17] 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02), 10Patch-For-Review: Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10Sfaci) >>! In T347899#9268234, @Lokal_Profil wrote: >>>! In T347899#9258560, @Sfaci wrote: >>... [13:06:28] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:16:46] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:37:32] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Jdforrester-WMF) [13:37:56] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Jdforrester-WMF) [13:59:15] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Prepare new WDQS hosts for graph splitting - https://phabricator.wikimedia.org/T347505 (10bking) 05Open→03Resolved Work is complete...resolving. [14:47:59] (SystemdUnitFailed) firing: (6) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:29:07] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) We discovered a small problem in testing: The `spark.shuffle.service.name` configuration option was only introduced in vers... [15:37:16] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: [Event Platform] Define Flink k8s operator SLO - https://phabricator.wikimedia.org/T345914 (10Ahoelzl) [15:40:57] 10Data-Platform-SRE: Reduce the prometheus-metricsfetcher cli complexity - https://phabricator.wikimedia.org/T349393 (10brouberol) [15:41:43] 10Data-Platform-SRE: Reduce the prometheus-metricsfetcher cli complexity - https://phabricator.wikimedia.org/T349393 (10CodeReviewBot) brouberol updated https://gitlab.wikimedia.org/repos/sre/kafkakit-prometheus-metricsfetcher/-/merge_requests/3 Integrate 2 new features from upstream [15:56:18] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: [Data Quality] [SPIKE] Can we identify indicators to inform an SLO for event emission and intake? - https://phabricator.wikimedia.org/T345195 (10Ahoelzl) [16:02:00] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Follow up on rdf-streaming-updater failure 2023-10-17 - https://phabricator.wikimedia.org/T349147 (10CodeReviewBot) bking opened https://gitlab.wikimedia.org/repos/search-platform/flink-rdf-streaming-updater/-/merge_requests/13... [16:08:46] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Epic: Spark 3 Migration - https://phabricator.wikimedia.org/T309993 (10lbowmaker) 05Open→03Resolved [16:13:13] 10Quarry, 10Toolforge, 10cloud-services-team (FY2023/2024-Q1): Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407 (10fnegri) p:05Triage→03Medium There were no objections in the WMCS meeting, so we can proceed with creating the DNS record... [16:21:36] 10Data-Engineering, 10Data-Platform-SRE: Write a design document relating to superset on dse-k8s - https://phabricator.wikimedia.org/T349396 (10BTullis) [16:22:05] 10Data-Engineering, 10Data-Platform-SRE: Write a design document relating to superset on dse-k8s - https://phabricator.wikimedia.org/T349396 (10BTullis) a:03BTullis [16:23:53] 10Data-Engineering, 10Data-Platform-SRE: Migrate matomo to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349397 (10BTullis) [16:30:55] 10Data-Engineering, 10Data-Platform-SRE, 10Data Products: Migrate an-web1001 to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349398 (10BTullis) [16:35:46] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Follow up on rdf-streaming-updater failure 2023-10-17 - https://phabricator.wikimedia.org/T349147 (10CodeReviewBot) bking merged https://gitlab.wikimedia.org/repos/search-platform/flink-rdf-streaming-updater/-/merge_requests/13... [16:36:29] 10Data-Engineering, 10Data-Platform-SRE: Migrate yarn.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349399 (10BTullis) [16:40:18] 10Data-Engineering, 10Data-Platform-SRE: Migrate hue.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349400 (10BTullis) [16:47:29] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: [Event Platform] mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10Ahoelzl) [16:47:54] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: [Event Platform] mw-page-content-change-enrich should (re)produce kafka keys - https://phabricator.wikimedia.org/T338231 (10Ahoelzl) [16:48:15] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: [Event Platform] eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10Ahoelzl) [16:50:17] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: [Event Platform] mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10Ahoelzl) [16:50:40] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config errors - https://phabricator.wikimedia.org/T326002 (10Ahoelzl) [16:51:06] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: [Event Platform] Enable snappy compression for Flink Kafka producers - https://phabricator.wikimedia.org/T345805 (10Ahoelzl) [16:51:29] (03PS1) 10Ebernhardson: cirrussearch/update_pipeline/update remove required fields [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/967478 [16:51:45] 10Data-Engineering, 10Data-Catalog, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: [Event Platform] Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10Ahoelzl) [16:51:58] (03PS2) 10Ebernhardson: cirrussearch/update_pipeline/update remove required fields [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/967478 [16:54:13] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10netops, and 2 others: [Maintenance] Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10Ahoelzl) [16:56:19] 10Data-Engineering, 10Data-Platform-SRE, 10SRE Observability, 10Data Engineering and Event Platform Team (Sprint 3): [Platform] Install a Prometheus connector for Presto, pointed at thanos-query - https://phabricator.wikimedia.org/T347430 (10Ahoelzl) [16:56:32] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3): [Platform] Stop and remove oozie services - https://phabricator.wikimedia.org/T341893 (10Ahoelzl) [16:56:52] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3), 10Patch-For-Review: [Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10Ahoelzl) [16:57:10] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3), 10Patch-For-Review: [Platform] Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10Ahoelzl) [16:59:01] 10Data-Engineering, 10Data-Platform-SRE: Migrate hue.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349400 (10BTullis) [16:59:03] 10Data-Engineering, 10Data-Platform-SRE: Migrate yarn.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349399 (10BTullis) [16:59:05] 10Data-Engineering, 10Data-Platform-SRE: Migrate matomo to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349397 (10BTullis) [16:59:07] 10Data-Engineering, 10Data-Platform-SRE, 10Data Products: Migrate an-web1001 to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349398 (10BTullis) [16:59:10] 10Data-Engineering, 10Discovery-Search, 10serviceops-radar, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: [Event Platform] [NEEDS GROOMING] Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10Ahoelzl) [16:59:31] 10Data-Engineering, 10Machine-Learning-Team, 10Wikimedia Enterprise, 10Data Engineering and Event Platform Team (Sprint 3), and 2 others: [Event Platform] Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10Ah... [17:00:16] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [EPIC] Flink Applications on Kubernetes - https://phabricator.wikimedia.org/T324578 (10lbowmaker) 05Open→03Resolved a:03lbowmaker [17:02:06] 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02), 10Patch-For-Review: Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10EChukwukere-WMF) Test status: //**QA PASS**// tested response and data ( compared with AQS 1... [17:07:13] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data Pipelines: [Airflow Migration] Migrate 1+ reportupdater jobs - https://phabricator.wikimedia.org/T307540 (10Ahoelzl) [17:15:46] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3): [Data Platform] Stop and remove oozie services - https://phabricator.wikimedia.org/T341893 (10Ahoelzl) [17:15:57] 10Data-Engineering, 10Data-Platform-SRE, 10SRE Observability, 10Data Engineering and Event Platform Team (Sprint 3): [Data Platform] Install a Prometheus connector for Presto, pointed at thanos-query - https://phabricator.wikimedia.org/T347430 (10Ahoelzl) [17:16:14] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3), 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10Ahoelzl) [17:16:32] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3), 10Patch-For-Review: [Data Platform] Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10Ahoelzl) [17:16:47] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:29:43] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all streams - https://phabricator.wikimedia.org/T266798 (10Ahoelzl) [17:29:48] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [Event Platform] Declare webrequest as an Event Platform stream - https://phabricator.wikimedia.org/T314956 (10Ahoelzl) [17:30:04] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform, 10Patch-For-Review: [Event Platform] Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities - https://phabricator.wikimedia.org/T321854 (10Ahoelzl) [17:30:11] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform, 10Wikimedia-production-error: [Event Platform] Error: Call to a member function exists() on null (via EventBus PageChangeEventSerializer) - https://phabricator.wikimedia.org/T346355 (10Ahoelzl) [17:30:40] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform: [Event Platform] Event streams don't respect milliseconds UTC unix epoch timestamp in since parameter - https://phabricator.wikimedia.org/T345606 (10Ahoelzl) [17:30:49] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [Event Platform] Add expiry info to mediawiki.page-restrictions-change stream - https://phabricator.wikimedia.org/T282057 (10Ahoelzl) [17:30:51] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [Event Platform] Can we import metrics from logstash to promethues? - https://phabricator.wikimedia.org/T347484 (10Ahoelzl) [17:31:57] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Epic, 10Event-Platform: [Event Platform] Flink Operations - https://phabricator.wikimedia.org/T328561 (10Ahoelzl) [18:47:59] (SystemdUnitFailed) firing: (6) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:32:45] (SystemdUnitFailed) firing: (6) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:35:27] (03PS2) 10Milimetric: Update schema of mediawiki_wikitext_* [analytics/refinery] - 10https://gerrit.wikimedia.org/r/966914 (https://phabricator.wikimedia.org/T348767) [19:35:35] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Follow up on rdf-streaming-updater failure 2023-10-17 - https://phabricator.wikimedia.org/T349147 (10bking) Deployed `flink-1.16.1-rdf-0.3.136` release for WCQS and WDQS staging, via: ` python3 flink/flink-job.py \ --env staging \ --jo... [19:37:27] (03CR) 10Milimetric: Update schema of mediawiki_wikitext_* (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/966914 (https://phabricator.wikimedia.org/T348767) (owner: 10Milimetric) [19:38:52] (03CR) 10Sbisson: T343183 add "story share" event; add "user_is_anonymous" field and bump to version 1.1.0 (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965846 (owner: 10Conniecc1) [19:50:35] (03CR) 10Xcollazo: [C: 03+1] Update schema of mediawiki_wikitext_* (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/966914 (https://phabricator.wikimedia.org/T348767) (owner: 10Milimetric) [19:54:43] (03PS3) 10Conniecc1: T343183 add "story share" event; add "user_is_anonymous" field and bump to version 1.1.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965846 [19:55:18] (03CR) 10CI reject: [V: 04-1] T343183 add "story share" event; add "user_is_anonymous" field and bump to version 1.1.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965846 (owner: 10Conniecc1) [19:57:26] (03CR) 10Conniecc1: [C: 03+2] T343183 add "story share" event; add "user_is_anonymous" field and bump to version 1.1.0 (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965846 (owner: 10Conniecc1) [20:07:50] (03PS4) 10Conniecc1: T343183 add "story share" event; add "user_is_anonymous" field and bump to version 1.1.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965846 [20:08:23] (03CR) 10CI reject: [V: 04-1] T343183 add "story share" event; add "user_is_anonymous" field and bump to version 1.1.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965846 (owner: 10Conniecc1) [20:11:21] (03PS5) 10Conniecc1: T343183 add "story share" event; add "user_is_anonymous" field and bump to version 1.1.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965846 [20:51:52] (03PS2) 10Bearloga: wikistories_contribution_event: add story_share event type [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965845 (https://phabricator.wikimedia.org/T343183) (owner: 10Conniecc1) [20:54:06] (03CR) 10Bearloga: [C: 03+2] "Made some minor formatting changes for consistency but otherwise looks good to me." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965845 (https://phabricator.wikimedia.org/T343183) (owner: 10Conniecc1) [20:54:38] (03Merged) 10jenkins-bot: wikistories_contribution_event: add story_share event type [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965845 (https://phabricator.wikimedia.org/T343183) (owner: 10Conniecc1) [20:55:52] (03CR) 10Milimetric: [C: 03+1] "I played a little with spark tuning and that deduplication logic, but couldn't find any obvious wins. My tests on enwiki, rowiki, and sim" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965792 (https://phabricator.wikimedia.org/T348767) (owner: 10Milimetric) [21:02:45] (SystemdUnitFailed) firing: (6) wmf_auto_restart_airflow-scheduler@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:04:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:15:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:17:01] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [21:19:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:52] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:45] (SystemdUnitFailed) firing: (6) wmf_auto_restart_airflow-scheduler@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed