[00:02:45] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Papaul) @Jhancock.wm i send a patch to fix it. you can resume the install https://gerrit.wikimedia.org/r/c/operations/puppet/+/981413 [00:16:42] (SystemdUnitFailed) firing: hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:18:17] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:08] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:42] (SystemdUnitFailed) resolved: hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:41:02] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Papaul) [08:47:33] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Discovery-Search (Current work): Load Wikidata split graphs into test servers - https://phabricator.wikimedia.org/T350465 (10Gehel) [08:47:53] * brouberol waves good morning 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Discovery-Search (Current work): Load Wikidata split graphs into test servers - https://phabricator.wikimedia.org/T350465 (10Gehel) a:03RKemper [09:06:01] 10Data-Platform-SRE: Validate the impact of a k8s upgrade on our Flink deployment - https://phabricator.wikimedia.org/T353045 (10Gehel) [09:06:21] 10Data-Platform-SRE: Validate the impact of a k8s upgrade on our Flink deployment - https://phabricator.wikimedia.org/T353045 (10Gehel) p:05Triage→03Medium [09:06:57] 10Data-Engineering, 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10Gehel) 05In progress→03Resolved [09:07:04] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10Gehel) [09:07:08] 10Data-Engineering, 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10Gehel) Closing this as we have all of what we need to move the search upd... [09:11:08] 10Quarry: CSV files not being written in UTF-8 - https://phabricator.wikimedia.org/T353047 (10Novem_Linguae) [09:28:39] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10serviceops, 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10Gehel) [09:29:36] 10Quarry: CSV files not being written in UTF-8 - https://phabricator.wikimedia.org/T353047 (10SD0001) 05Open→03Invalid The text/csv response does specify chatset=utf-8. {F41574205} I couldn't reproduce this – the downloaded file looks alright to me: {F41574052} Please check if this is a problem with the lo... [09:31:28] 10Quarry: CSV files not being written in UTF-8 - https://phabricator.wikimedia.org/T353047 (10Novem_Linguae) Yeah I think you're right. It looks OK in Notepad++. Interesting! {F41574241} [12:40:42] 10Data-Engineering (Sprint 6): [Data Quality] Metrics Alerting - https://phabricator.wikimedia.org/T352685 (10gmodena) Thanks for the reminder @Ahoelzl . My plan was to ping observability folks in the review of this phab (to have something concrete to show them). > FWIW, Alert Manager won't work well for histor... [12:41:21] 10Data-Engineering, 10Observability-Metrics: [Data Quality] Sending Apache Spark metrics to PushGateway - https://phabricator.wikimedia.org/T297231 (10BTullis) Has anyone considered using the Spark HIstory Server (T330176) for an application metrics store? We're very close to having the history server up and r... [13:18:36] 10Data-Engineering (Sprint 6): [Data Quality] Metrics Alerting - https://phabricator.wikimedia.org/T352685 (10BTullis) In case it helps, there may also be some cross-over with {T343234} which talks about the possibility of creating [[https://airflow.apache.org/docs/apache-airflow/2.7.3/howto/notifications.html|c... [13:41:21] 10Data-Engineering (Sprint 6): [Data Quality] Metrics Alerting - https://phabricator.wikimedia.org/T352685 (10BTullis) >> FWIW, Alert Manager won't work well for historical dataset based alerts. The best we can do in Alert Manager is 'there is a problem in the last