[01:15:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:17:44] (SystemdUnitFailed) resolved: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:11:14] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 5.922% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [06:11:15] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 5.617% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:01:40] 10Data-Platform-SRE (23/24 Q2 Milestone 1): Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene) Druid100[4-6] are now fully drained and we can proceed with the next steps on the decommission process {F41594054} [09:48:44] claime: would you have some time in the coming day(s) to pair on redeploying the dse k8s ingress LVS service? We've deployed a dummy service which deployment ensures that the 30443 port is open on each worker host, so the deployment should work this time. Thanks! [10:09:07] brouberol: sure [10:11:29] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 5.575% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [10:31:43] 10Data-Platform-SRE: Improve observability for non-k8s Envoy proxies (wdqs) - https://phabricator.wikimedia.org/T353003 (10Gehel) p:05Triage→03Medium [11:13:18] 10Data-Engineering, 10CirrusSearch, 10Discovery-Search, 10Image-Suggestions, 10Structured-Data-Backlog: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27 - https://phabricator.wikimedia.org/T353134 (10dcausse) [12:07:11] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 (10brouberol) ` brouberol@cumin1001:~$ host k8s-ingress-dse.svc.eqiad.wmnet k8s-ingress-dse.svc.eqiad.wmnet has address 10.2.2.91 brouberol@cumin1001:~$ hos... [12:08:44] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/2 Add a publish stage to... [12:13:10] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 (10brouberol) ` brouberol@dns1004:~$ host spark-history.svc.eqiad.wmnet spark-history.svc.eqiad.wmnet is an alias for k8s-ingress-dse.svc.eqiad.wmnet. k8s-i... [12:41:08] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/2 Add a publish stage to... [12:50:13] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/3 Fix the location of the... [13:10:15] 10Data-Platform-SRE (23/24 Q2 Milestone 1): [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10BTullis) I have done some investigation regarding the other authentication methods that are available to Airflow. In general, this is very simila... [13:15:09] btullis: just a head's up, I don't know if you're aware but the flink-kubernetes-operator image has failed building since 20231106 [13:16:51] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/3 Fix the location of the... [13:49:48] claime: OK, thanks. I haven't been too involved in that image, but I will give it a look. [13:56:32] 10Data-Platform-SRE (23/24 Q2 Milestone 1): [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10Ottomata) > Create a simple admin:admin user for each airflow instance I think this makes sense until we have something better. Our instances a... [14:08:32] 10Data-Engineering, 10Observability-Metrics: [Data Quality] Sending Apache Spark metrics to PushGateway - https://phabricator.wikimedia.org/T297231 (10Ottomata) Yes, but > We already have the JSON API plugin installed in Grafana... This I did not know about! This is very cool! [14:11:30] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 5.491% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [14:23:28] 10Data-Engineering (Sprint 6): [Data Quality] Metrics Alerting - https://phabricator.wikimedia.org/T352685 (10Ottomata) > Is this a matter of semantics or are there some implicit limitations of the backend? IIUC, each alert is manually defined based on a prometheus query. So, say dataset A has partition hours... [14:26:11] 10Data-Engineering (Sprint 6): [Data Quality] Metrics Alerting - https://phabricator.wikimedia.org/T352685 (10Ottomata) I could be very wrong though! We should check with observability. [14:43:41] !log roll-restarting the aqs (nodejs based) services with https://gerrit.wikimedia.org/r/c/operations/puppet/+/982097 [14:43:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:45:44] 10Data-Engineering (Sprint 6): [Iceberg Migration] Migrate aqs hourly tables to Iceberg - https://phabricator.wikimedia.org/T352669 (10tchin) a:03tchin [15:00:31] 10Data-Engineering, 10CirrusSearch, 10Discovery-Search, 10Image-Suggestions, 10Structured-Data-Backlog: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27 - https://phabricator.wikimedia.org/T353134 (10Milimetric) > wmf_r... [15:06:44] 10Data-Engineering, 10CirrusSearch, 10Discovery-Search, 10Image-Suggestions, 10Structured-Data-Backlog: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27 - https://phabricator.wikimedia.org/T353134 (10dcausse) [15:06:58] 10Data-Engineering, 10CirrusSearch, 10Discovery-Search, 10Image-Suggestions, 10Structured-Data-Backlog: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27 - https://phabricator.wikimedia.org/T353134 (10dcausse) @Milimetri... [15:26:58] 10Data-Platform-SRE (23/24 Q2 Milestone 1): [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10xcollazo) >Report it as a bug upstream and point out that None should be a valid value for user_id To me, this feels like a regression. A feature... [15:33:18] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye [15:51:37] 10Data-Engineering, 10CirrusSearch, 10Discovery-Search (Current work): [Search Update Pipeline] Source streams for private wikis - https://phabricator.wikimedia.org/T346046 (10Ottomata) Ideally, we should think about private and public versions of every stream. Private streams still have all events, Public... [15:58:34] 10Data-Engineering (Sprint 6), 10Event-Platform, 10Patch-For-Review: [Event Platform] change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10elukey) @Ottomata let's send an email to ops@ to alert about this change, it is not big but it is the first one in a while (in c... [16:03:30] 10Data-Engineering (Sprint 6), 10Event-Platform, 10Patch-For-Review: [Event Platform] change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10Ottomata) Okay, I'll send one along with the enabling of canary events in general. TY [16:14:00] 10Data-Platform-SRE, 10Discovery-Search (Current work): Create SLI / SLO on Search update lag - https://phabricator.wikimedia.org/T328330 (10Gehel) [16:14:20] 10Data-Platform-SRE, 10Discovery-Search (Current work): Create SLI / SLO on Search update lag - https://phabricator.wikimedia.org/T328330 (10Gehel) p:05Triage→03High [16:23:07] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/4 Set the npm proxy on th... [16:23:11] 10Data-Engineering, 10CirrusSearch, 10Image-Suggestions, 10Structured-Data-Backlog, 10Discovery-Search (Current work): Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27 - https://phabricator.wikimedia.org/T353134 (10Gehel) [16:24:28] 10Data-Platform-SRE: ProbeDown - https://phabricator.wikimedia.org/T353065 (10Gehel) [16:26:38] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye completed: - cephosd2002 (... [16:27:03] 10Data-Engineering, 10tech-decision-forum, 10Event-Platform: MediaWiki Event Carried State Transfer - Problem Statement - https://phabricator.wikimedia.org/T291120 (10Ottomata) [16:28:02] 10Analytics, 10Data-Engineering (Sprint 6), 10Event-Platform, 10Patch-For-Review, 10User-notice: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) This has been done! [16:38:04] 10Data-Platform-SRE (23/24 Q2 Milestone 1): [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10BTullis) >>! In T352534#9396561, @xcollazo wrote: >>Report it as a bug upstream and point out that None should be a valid value for user_id > To... [16:38:36] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/4 Set the npm proxy on th... [17:20:38] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [17:20:40] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 (10brouberol) 05Open→03Resolved [17:20:43] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) [17:21:23] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) 05Open→03Resolved @BTullis this is completed! [17:43:52] 10Data-Engineering, 10CirrusSearch, 10Image-Suggestions, 10Structured-Data-Backlog, 10Discovery-Search (Current work): Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27 - https://phabricator.wikimedia.org/T353134 (10odimi... [17:44:20] 10Data-Engineering (Sprint 6), 10Data Pipelines, 10Event-Platform, 10MW-1.41-notes (1.41.0-wmf.28; 2023-09-26): [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config fetch errors - https://phabricator.wikimedia.org/T326002 (10Ottomata) Created data loss report at ht... [18:11:30] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 5.418% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:14:47] (03PS1) 10Milimetric: Update deletion script with linktarget move [analytics/refinery] - 10https://gerrit.wikimedia.org/r/982149 [19:15:08] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update deletion script with linktarget move [analytics/refinery] - 10https://gerrit.wikimedia.org/r/982149 (owner: 10Milimetric) [21:16:14] 10Data-Engineering, 10Browser-Support-Microsoft-Edge, 10Event-Platform, 10Wikimedia-Performance-recommendation: Problem with delay caused by intake-analytics.wikimedia.org - https://phabricator.wikimedia.org/T295427 (10BTullis) a:05BTullis→03None [21:24:26] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/6 Use a node builder for... [21:29:54] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/6 Use a node builder for... [21:34:58] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Event-Platform, 10Patch-For-Review: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 (10Ottomata) [22:11:30] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 5.344% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:46:27] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [22:46:27] mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [23:22:56] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10bking) There is a way to test this manually from the icinga host; see Daniel Zahn's comment [[ https://gerrit.wikimedia.org/r/c/operations/pupp... [23:34:09] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/7 Use a plain node builder [23:35:12] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/7 Use a plain node builder [23:45:16] 10Data-Engineering-Radar, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 05), 10Technical-Debt: Non-deterministic unit test "streamInSample() - session sampling resets" - https://phabricator.wikimedia.org/T304379 (10mpopov) From @phuedx on Slack: >... [23:51:27] (MediawikiPageContentChangeEnrichJobManagerNotRunning) resolved: ... [23:51:27] mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [23:53:27] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ... [23:53:27] The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning