[00:23:51] (03PS1) 10Sharvaniharan: Android: New schema for image recommendations feature [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/940266 [00:25:26] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1153.eqiad.wmnet with OS bullseye completed: - an-worker1153 (*... [00:25:50] (03PS2) 10Sharvaniharan: Android: New schema for image recommendations feature [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/940266 [00:26:08] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) [00:26:38] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) finished 53-56. should have time to finish the last 4 tomorrow afternoon [01:33:41] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Papaul) Thank you [03:37:27] (MediawikiPageContentChangeEnrichAvailability) firing: ... [03:37:27] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [03:49:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:49:43] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:52:16] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Configure new WDQS servers in codfw (wdqs20[13-22]) - https://phabricator.wikimedia.org/T332314 (10RKemper) [03:57:25] 10Data-Platform-SRE: Decommission old WDQS servers - https://phabricator.wikimedia.org/T342035 (10RKemper) a:03RKemper With the new hosts in service, we can now begin decom'ing these hosts at our convenience. [03:58:53] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Configure new WDQS servers in codfw (wdqs20[13-22]) - https://phabricator.wikimedia.org/T332314 (10RKemper) `wdqs202[1-2]` have been brought into service. With teh merging of https://gerrit.w... [04:00:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:04:43] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:33:33] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Data-Persistence: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui) 05Open→03Resolved MariaDB 10.6 has been installed host is showing all green on Icinga, so closing this. Thanks! [06:58:42] 10Data-Engineering, 10Data-Services, 10Growth-Team, 10PageTriage, 10cloud-services-team: Clean up pagetriage_log views - https://phabricator.wikimedia.org/T331844 (10Novem_Linguae) 05Open→03Resolved a:03Novem_Linguae https://gerrit.wikimedia.org/r/c/operations/puppet/+/884454/ was merged on June 13... [07:15:48] 10Data-Platform-SRE, 10DBA, 10Patch-For-Review: Migrate dbstore1005 to MariaDB 10.6 - https://phabricator.wikimedia.org/T334652 (10Marostegui) 05Open→03Resolved This host has been migrated to mariadb 10.6 [07:37:27] (MediawikiPageContentChangeEnrichAvailability) firing: ... [07:37:27] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [08:04:43] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@analytics_meta.service Failed on db1108:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:53:19] RECOVERY - mysqld processes on db1108 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:56:24] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Data-Persistence: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6df61f58-13ae-431b-ba26-03ec11d8c678) set by btullis@cumin1001 fo... [09:28:30] 10Data-Platform-SRE, 10SRE-OnFire, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Review alerting around Wikidata Query Service update pipeline - https://phabricator.wikimedia.org/T336574 (10Gehel) p:05Triage→03Medium [10:26:44] (SystemdUnitCrashLoop) firing: confd-k8s.service crashloop on dse-k8s-ctrl1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [10:28:28] ^ jayme could this be related to your confd work? I think it's non-critical for us right now, but you may be interested. [10:28:46] yep, thats me - sorry [10:29:20] no probs. Happy to help if there's anything I can do. [10:30:53] nah, I just disabled puppet do late. Won't do any harm though [10:31:03] :+1 [11:31:44] (SystemdUnitCrashLoop) resolved: confd-k8s.service crashloop on dse-k8s-ctrl1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:37:27] (MediawikiPageContentChangeEnrichAvailability) firing: ... [11:37:27] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [11:45:38] 10Data-Platform-SRE: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) [12:04:43] (SystemdUnitFailed) firing: jupyter-dsaez-singleuser-conda-analytics.service Failed on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:18:00] 10Data-Engineering, 10Unstewarded-production-error, 10Wikimedia-production-error: '.client_dt' should match format "date-time", '.event.pageNamespace' should be integer, '.event.skinVersion' should be integer - https://phabricator.wikimedia.org/T321329 (10Aklapper) 05Open→03Resolved `type:eventgate_valid... [12:50:20] 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Gehel) p:05Triage→03Medium [12:50:42] 10Data-Platform-SRE: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10Gehel) p:05Triage→03High [12:53:43] 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Gehel) 05Open→03Resolved [12:53:48] 10Data-Platform-SRE: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10Gehel) 05Open→03Resolved [12:53:57] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10Gehel) 05Open→03Resolved [12:54:06] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Configure new WDQS servers in codfw (wdqs20[13-22]) - https://phabricator.wikimedia.org/T332314 (10Gehel) 05Open→03Resolved [12:54:21] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Configure new WDQS servers in codfw (wdqs20[13-22]) - https://phabricator.wikimedia.org/T332314 (10Gehel) p:05Triage→03High [12:54:47] 10Data-Platform-SRE: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) Hi @AndrewTavis_WMDE, this has been picked up and is in progress. Currently working on creating the WMDE airflow admin group for managing WMDE related analytics/data jobs and deploying airf... [13:12:39] 10Data-Platform-SRE: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10AndrewTavis_WMDE) Great to hear, @Stevemunene! Thanks for the support :) Myself and @Manuel would be the admins per prior conversations we've had. [13:13:55] 10Data-Platform-SRE: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Manuel) What Andrew said! Please also add our engineering manager @karapayneWMDE to the group. [13:45:55] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10bking) This is complete; moving to 'needs review'. [13:54:36] 10Data-Platform-SRE: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) >>! In T340648#9034129, @Manuel wrote: > What Andrew said! Please also add our engineering manager @karapayneWMDE to the group. Noticed @karapayneWMDE is not part of the parent `analytics-... [14:09:30] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10pfischer) @JMeybohm, @dcausse and I put together [[ https://docs.google.com/spreadsheets/d/1Fp44MdLxUVlxi03MBD... [14:14:16] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10Joe) A two quarters rollout would mean that we can't really wait for the removal of the old topics; this would... [14:22:52] 10Data-Platform-SRE: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Manuel) Can you add @karapayneWMDE to the parent group? If not, what would be required to do so? (see {T284308} for reference) [14:52:27] (MediawikiPageContentChangeEnrichAvailability) resolved: ... [14:52:27] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [14:53:27] (MediawikiPageContentChangeEnrichAvailability) firing: ... [14:53:27] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [16:04:43] (SystemdUnitFailed) firing: jupyter-dsaez-singleuser-conda-analytics.service Failed on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:58:02] milimetric: take a look at this that can be self hosted an integrates with kafka: https://posthog.com/product-analytics it is teh new matomo [17:32:45] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1152.eqiad.wmnet with OS bullseye [17:37:21] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10RobH) [17:37:29] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10RobH) [18:30:55] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1152.eqiad.wmnet with OS bullseye [18:53:27] (MediawikiPageContentChangeEnrichAvailability) firing: ... [18:53:27] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [19:04:33] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1152.eqiad.wmnet with OS bullseye completed: - an-worker1152 (*... [19:06:23] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) [19:07:19] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1151.eqiad.wmnet with OS bullseye [19:27:01] 10Data-Platform-SRE, 10Discovery-Search: Write new partman recipe for cloudelastic (jbod) - https://phabricator.wikimedia.org/T342463 (10bking) [19:39:46] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Flomeier85) Hi @Htriedman I have a follow-up question to the discussion you and @Strainu had a few months... [19:41:21] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1151.eqiad.wmnet with OS bullseye completed: - an-worker1151 (*... [19:42:07] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) [19:43:34] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1150.eqiad.wmnet with OS bullseye [20:04:43] (SystemdUnitFailed) firing: jupyter-dsaez-singleuser-conda-analytics.service Failed on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:19:21] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1150.eqiad.wmnet with OS bullseye completed: - an-worker1150 (*... [20:20:06] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) [20:21:38] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1149.eqiad.wmnet with OS bullseye [20:40:57] 10Data-Platform-SRE, 10Discovery-Search: Write new partman recipe for cloudelastic (jbod) - https://phabricator.wikimedia.org/T342463 (10bking) Haven't looked closely, but I'm guessing the following recipes from `modules/install_server/files/autoinstall/netboot.cfg` could work (or would be the easiest to adapt... [20:41:58] 10Data-Platform-SRE, 10Discovery-Search: Write new partman recipe for cloudelastic (jbod) - https://phabricator.wikimedia.org/T342463 (10bking) [21:00:09] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1149.eqiad.wmnet with OS bullseye completed: - an-worker1149 (*... [21:01:14] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) [21:02:32] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) 05Open→03Resolved @BTullis finally finished. thanks for your patience. [21:58:48] 10Data-Engineering, 10Advanced-Search, 10All-and-every-Wikisource, 10ArticlePlaceholder, and 62 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson) [22:07:11] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Htriedman) @Flomeier85 if you have any questions at all feel free to post them here or reach out to me via... [22:53:27] (MediawikiPageContentChangeEnrichAvailability) firing: ... [22:53:27] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability