[02:04:44] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:19:44] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:13:28] (MediawikiPageContentChangeEnrichAvailability) firing: ... [03:13:29] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [03:19:44] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:34:44] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:49:45] (SystemdUnitFailed) firing: (2) refinery-sqoop-wikifunctions-production.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:10:14] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) on the datahub charts we need to replace the jaas configmap with an oidc setup For the env variables we can avail them via ` {{- if .Values.auth.... [07:13:44] (MediawikiPageContentChangeEnrichAvailability) firing: ... [07:13:44] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [07:16:43] 10Data-Engineering, 10DBA: dbstore1007:s2 crashed - https://phabricator.wikimedia.org/T343109 (10Marostegui) [08:48:13] 10Data-Engineering, 10DBA: dbstore1007:s2 crashed - https://phabricator.wikimedia.org/T343109 (10Ladsgroup) and since it's multiinstance, we can't use the clone cookbook [08:54:24] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: dbstore1007:s2 crashed - https://phabricator.wikimedia.org/T343109 (10BTullis) p:05Triage→03High [08:57:25] (03CR) 10Nmaphophe: GDI Equity Landscape Tables (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/941911 (owner: 10Nmaphophe) [09:02:22] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: dbstore1007:s2 crashed - https://phabricator.wikimedia.org/T343109 (10BTullis) Thanks ever so much for picking up this crash and creating this ticket. Just so that I understand the context, you're suggesting re-cloning because you feel that there's likely to be... [09:10:23] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: dbstore1007:s2 crashed - https://phabricator.wikimedia.org/T343109 (10Marostegui) It is probably unlikely there's data inconsistency after that just one crash, however, it is likely that that kind of crash is going to happen again at some point (that is my expe... [09:22:51] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10BTullis) p:05Triage→03Medium [10:06:54] 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for btmwiktionary - https://phabricator.wikimedia.org/T342670 (10BTullis) 05Open→03Resolved I've verified that this is now available from toolforge like this: ` btullis@tools-sgebastion-10:~$ sql btmwiktionary Reading table info... [10:49:59] (SystemdUnitFailed) firing: refinery-sqoop-wikifunctions-production.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:16:47] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10GitLab (Project Migration), 10Release-Engineering-Team (Priority Backlog 📥): Migrate analytics/datahub pipeline to GitLab - https://phabricator.wikimedia.org/T341194 (10BTullis) Oh, I'm getting a crashloop from the mce-container: It looks l... [11:18:29] (MediawikiPageContentChangeEnrichAvailability) firing: ... [11:18:29] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [11:20:18] 10Data-Platform-SRE, 10Abstract Wikipedia team, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for Wikifunctions.org (new public content wiki) - https://phabricator.wikimedia.org/T289316 (10BTullis) This has now been done. I verified that the database is available from too... [11:38:51] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10GitLab (Project Migration), 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Migrate analytics/datahub pipeline to GitLab - https://phabricator.wikimedia.org/T341194 (10CodeReviewBot) btullis opened https://gitlab.wiki... [11:55:34] 10Data-Platform-SRE, 10Abstract Wikipedia team, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for Wikifunctions.org (new public content wiki) - https://phabricator.wikimedia.org/T289316 (10Marostegui) 05Open→03Resolved Thanks [12:25:40] !log upgrading airflow on an-launcher1002 to 2.6.3 [12:25:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:39:27] 10Data-Platform-SRE, 10Data Pipelines, 10Patch-For-Review: Upgrade Airflow to version 2.6.3 - https://phabricator.wikimedia.org/T336286 (10BTullis) I have upgraded the analytics instance. All migrations executed. ` btullis@an-launcher1002:~$ sudo su - analytics airflow-analytics db upgrade su: warning: cann... [12:49:48] 10Data-Platform-SRE: Ensure WCQS stack works on Bullseye or later - https://phabricator.wikimedia.org/T342701 (10Gehel) [12:52:20] 10Data-Platform-SRE: Ensure WCQS stack works on Bullseye or later - https://phabricator.wikimedia.org/T342701 (10Gehel) [12:52:23] 10Data-Platform-SRE, 10Discovery-Search (Current work): Ensure WDQS stack works on Bullseye - https://phabricator.wikimedia.org/T331300 (10Gehel) [12:52:43] 10Data-Platform-SRE: Ensure WCQS stack works on Bullseye or later - https://phabricator.wikimedia.org/T342701 (10Gehel) WCQS and WDQS are running the same stack. Proving that one works is sufficient. [12:55:42] 10Data-Platform-SRE, 10Epic: [Epic] Migrate all Search Platform servers to Debian Bullseye - https://phabricator.wikimedia.org/T323921 (10Gehel) [13:02:17] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10Gehel) a:03bking [13:10:42] 10Data-Platform-SRE, 10Data Pipelines, 10Patch-For-Review: Upgrade Airflow to version 2.6.3 - https://phabricator.wikimedia.org/T336286 (10BTullis) Upgraded the search instance: ` btullis@an-airflow1005:~$ sudo run-puppet-agent Info: Using configured environment 'production' Info: Retrieving pluginfacts Info... [13:18:03] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10GitLab (Project Migration), 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Migrate analytics/datahub pipeline to GitLab - https://phabricator.wikimedia.org/T341194 (10CodeReviewBot) btullis merged https://gitlab.wiki... [13:37:12] 10Data-Platform-SRE, 10Data Pipelines, 10Patch-For-Review: Upgrade Airflow to version 2.6.3 - https://phabricator.wikimedia.org/T336286 (10BTullis) Upgraded the research instance: ` btullis@an-airflow1002:~$ sudo run-puppet-agent Info: Using configured environment 'production' Info: Retrieving pluginfacts In... [14:08:20] 10Data-Platform-SRE, 10Data Pipelines, 10Patch-For-Review: Upgrade Airflow to version 2.6.3 - https://phabricator.wikimedia.org/T336286 (10BTullis) I have upgraded the platform_eng instance. ` btullis@an-airflow1004:~$ sudo run-puppet-agent Info: Using configured environment 'production' Info: Retrieving plu... [14:54:44] (SystemdUnitFailed) firing: refinery-sqoop-wikifunctions-production.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:18:29] (MediawikiPageContentChangeEnrichAvailability) firing: ... [15:18:29] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [15:29:23] hi folks! [15:29:35] I filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/941840 a while ago, changes applied to kafka main and logging [15:29:59] I'd say that we should apply to jumbo too, for future resilience, even if from the metrics we don't really need to [15:30:10] but I'll let you decide, we can keep things as they are [15:59:51] 10Data-Platform-SRE, 10Data Pipelines, 10Patch-For-Review: Upgrade Airflow to version 2.6.3 - https://phabricator.wikimedia.org/T336286 (10BTullis) I have upgraded the analyics-product instance of Airflow, which is the last of the production instances. ` btullis@an-airflow1006:~$ sudo run-puppet-agent Info:... [16:00:04] 10Data-Platform-SRE, 10Data Pipelines, 10Patch-For-Review: Upgrade Airflow to version 2.6.3 - https://phabricator.wikimedia.org/T336286 (10BTullis) [16:00:31] 10Data-Platform-SRE, 10Data Pipelines, 10Patch-For-Review: Upgrade Airflow to version 2.6.3 - https://phabricator.wikimedia.org/T336286 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/450 Update airflow to version 2.6.3 [16:02:20] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Decomission an-airflow1003 (legacy platform_eng instance) - https://phabricator.wikimedia.org/T315633 (10BTullis) I'm going to start work on this task to decom an-airflow1003. The main reason for wanting to get it done is that... [16:12:44] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10GitLab (Project Migration), 10Release-Engineering-Team (Priority Backlog 📥): Migrate analytics/datahub pipeline to GitLab - https://phabricator.wikimedia.org/T341194 (10BTullis) This is now done. We have fully deployed the new images genera... [16:17:37] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Decomission an-airflow1003 (legacy platform_eng instance) - https://phabricator.wikimedia.org/T315633 (10BTullis) p:05Triage→03Medium [16:22:11] (03PS5) 10Sharvaniharan: Android: New schema for image recommendations feature [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/940266 [16:26:49] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Decomission an-airflow1003 (legacy platform_eng instance) - https://phabricator.wikimedia.org/T315633 (10BTullis) [16:30:33] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Decomission an-airflow1003 (legacy platform_eng instance) - https://phabricator.wikimedia.org/T315633 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by btullis@cumin1001 for hosts: `an-airflow1003.eqiad.wmnet... [16:32:27] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Decomission an-airflow1003 (legacy platform_eng instance) - https://phabricator.wikimedia.org/T315633 (10BTullis) [16:32:51] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Decomission an-airflow1003 (legacy platform_eng instance) - https://phabricator.wikimedia.org/T315633 (10BTullis) 05Open→03Resolved [16:40:40] 10Data-Platform-SRE, 10decommission-hardware, 10Patch-For-Review: decommission db1108.eqiad.wmnet - https://phabricator.wikimedia.org/T336254 (10BTullis) [16:45:59] 10Data-Platform-SRE, 10decommission-hardware, 10Patch-For-Review: decommission db1108.eqiad.wmnet - https://phabricator.wikimedia.org/T336254 (10BTullis) [16:47:20] 10Data-Platform-SRE, 10decommission-hardware, 10Patch-For-Review: decommission db1108.eqiad.wmnet - https://phabricator.wikimedia.org/T336254 (10BTullis) [17:03:38] 10Data-Platform-SRE, 10Discovery-Search (Current work): decom an-airflow1001 - https://phabricator.wikimedia.org/T333697 (10BTullis) 05Open→03Resolved a:03BTullis I think that this task is complete. [17:03:41] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): [Tracking] Migrate Search Airflow jobs to Airflow 2 and use shared supporting code from the data engineering Airflow - https://phabricator.wikimedia.org/T318414 (10BTullis) [17:04:26] 10Data-Platform-SRE, 10decommission-hardware, 10Patch-For-Review: decommission db1108.eqiad.wmnet - https://phabricator.wikimedia.org/T336254 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by btullis@cumin1001 for hosts: `db1108.eqiad.wmnet` - db1108.eqiad.wmnet (**WARN**) - Downtimed hos... [17:07:54] 10Data-Platform-SRE, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decommission db1108.eqiad.wmnet - https://phabricator.wikimedia.org/T336254 (10BTullis) a:05BTullis→03Jclark-ctr [17:08:07] 10Data-Platform-SRE, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decommission db1108.eqiad.wmnet - https://phabricator.wikimedia.org/T336254 (10BTullis) [18:39:17] 10Data-Engineering, 10Data Products (Sprint 0): [opsweek] Don't pollute skein logs. Part II. - https://phabricator.wikimedia.org/T342926 (10xcollazo) [18:39:23] 10Data-Platform-SRE: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10xcollazo) [18:54:44] (SystemdUnitFailed) firing: refinery-sqoop-wikifunctions-production.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:03:56] !log Deployed https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/471 for analytics Airflow instance [19:03:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:18:29] (MediawikiPageContentChangeEnrichAvailability) firing: ... [19:18:29] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [20:40:22] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Decomission an-airflow1003 (legacy platform_eng instance) - https://phabricator.wikimedia.org/T315633 (10xcollazo) a:05xcollazo→03BTullis [22:54:45] (SystemdUnitFailed) firing: refinery-sqoop-wikifunctions-production.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:18:29] (MediawikiPageContentChangeEnrichAvailability) firing: ... [23:18:29] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [23:35:22] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10GitLab (Project Migration), 10Release-Engineering-Team (Priority Backlog 📥): Migrate analytics/datahub pipeline to GitLab - https://phabricator.wikimedia.org/T341194 (10thcipriani) >>! In T341194#9052367, @thcipriani wrote: >>>! In T341194#...