[00:46:56] 10Data-Engineering: nshahquinn-wmf cannot edit in DataHub - https://phabricator.wikimedia.org/T344212 (10nshahquinn-wmf) [00:49:58] 10Data-Engineering: nshahquinn-wmf cannot edit in DataHub - https://phabricator.wikimedia.org/T344212 (10nshahquinn-wmf) [03:43:53] (MediawikiPageContentChangeEnrichAvailability) firing: ... [03:43:53] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [04:30:19] 10Data-Engineering: Request DataHub edit access for David Martin (dmartin) - https://phabricator.wikimedia.org/T344217 (10DMartin-WMF) [04:46:52] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 3 others: Merge Ks-Arab and Ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10srishakatux) a:03srishakatux [07:48:39] (MediawikiPageContentChangeEnrichAvailability) firing: ... [07:48:39] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [09:34:16] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1096.eqiad.wmnet with OS bullseye [10:00:32] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10Stevemunene) Te changes have been merged and @karapayneWMDE now has shell access and is a member of `analytics-wmde-users` [10:56:05] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1096.eqiad.wmnet with OS bullseye executed with errors: - an-worker1096 (**FAIL**) - Downtimed on Icinga... [11:07:42] (SystemdUnitFailed) firing: clean-confd-rundir.service Failed on an-db1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:12:03] 10Analytics, 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Research-Freezer: 20K events by a single user in the span of 20 mins - https://phabricator.wikimedia.org/T202539 (10phuedx) 05Open→03Declined Being **bold**. The CitationUsage instrument was removed in September 2020 and disabled r... [11:16:34] 10Data-Engineering, 10MediaWiki-extensions-EventLogging: Integration tests should exercise code with and without EventStreamConfig loaded - https://phabricator.wikimedia.org/T344239 (10phuedx) [11:24:54] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) [11:32:42] (SystemdUnitFailed) resolved: clean-confd-rundir.service Failed on an-db1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:48:39] (MediawikiPageContentChangeEnrichAvailability) firing: ... [11:48:39] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [12:16:20] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) Actively working on this, thus moving it back in progress as we plan on implementing the solutions defined on https://phabricator.wikimedia.org/T3432... [14:01:44] 10Data-Engineering, 10Growth-Team, 10MediaWiki-extensions-EventLogging, 10Notifications, and 2 others: Decommission the EchoMail and EchoInteraction instruments - https://phabricator.wikimedia.org/T344167 (10phuedx) [14:07:15] 10Data-Platform-SRE: WDQS/WCQS: Create a script or process that verifies "deployment worthiness" - https://phabricator.wikimedia.org/T343712 (10bking) Changes to the process suggested by @Gehel have simplified the documentation enough that I don't believe we'll need a script at this point. Closing for now... [14:07:40] 10Data-Platform-SRE: WDQS/WCQS: Create a script or process that verifies "deployment worthiness" - https://phabricator.wikimedia.org/T343712 (10bking) 05Open→03Declined p:05Triage→03Medium [14:17:55] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1097.eqiad.wmnet with OS bullseye [14:20:26] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10bking) > Unknowns: > * We have talked about running a copy of the main flink application per datacenter, with... [14:26:29] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1015.eqiad.wmnet with OS bullseye [14:26:54] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1014.eqiad.wmnet with OS bullseye [14:32:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:33:09] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:55] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:07:51] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1097.eqiad.wmnet with OS bullseye completed: - an-worker1097 (**WARN**) - Downtimed on Icinga/Alertmanag... [15:27:45] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1014.eqiad.wmnet with OS bullseye completed: - wdqs1014 (**WARN**) - Downtimed on Icinga/Alertman... [15:36:42] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10Gehel) 05Open→03Resolved [15:36:44] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Gehel) [15:37:29] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10Gehel) 05Open→03Resolved [15:37:32] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10Gehel) [15:37:42] 10Data-Platform-SRE: Decommission wdqs10[03-05] - https://phabricator.wikimedia.org/T344198 (10RKemper) a:03RKemper [15:41:44] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk2001.codfw.wmnet with OS bo... [15:44:22] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1015.eqiad.wmnet with OS bullseye completed: - wdqs1015 (**WARN**) - Downtimed on Icinga/Alertman... [15:48:39] (MediawikiPageContentChangeEnrichAvailability) firing: ... [15:48:39] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [15:55:26] 10Data-Engineering, 10Data-Engineering-Jupyter, 10Data-Platform-SRE: Autocomplete is very slow (unusable) in Newpyter - https://phabricator.wikimedia.org/T290008 (10BTullis) @diego - Would it be possible for you to test this again in the newer version of our JupyterHub environment please? I'm keen to see whe... [15:56:37] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Re-clone dbstore1007:s2 following a crash - https://phabricator.wikimedia.org/T343109 (10BTullis) [16:04:14] 10Data-Engineering, 10Data-Platform-SRE, 10serviceops-radar, 10Patch-For-Review: Get datahub-staging.wikimedia.org working with the staging deployment of datahub - https://phabricator.wikimedia.org/T343236 (10BTullis) 05Open→03Declined a:05Stevemunene→03None >>! In T343236#9086687, @JMeybohm wrote:... [16:04:18] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Data Catalog MVP - https://phabricator.wikimedia.org/T299910 (10BTullis) [16:05:36] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1098.eqiad.wmnet with OS bullseye [16:10:46] 10Data-Platform-SRE, 10Observability-Alerting, 10observability: Create VictorOps config for new Data Platform SRE team - https://phabricator.wikimedia.org/T344202 (10herron) Hey @bking, thanks for the task! Could you please point me towards the current team roster in order to get the ball rolling for team/a... [16:26:57] 10Data-Platform-SRE, 10Observability-Alerting, 10observability: Create VictorOps config for new Data Platform SRE team - https://phabricator.wikimedia.org/T344202 (10BTullis) I'm wondering whether it would be a good idea to re-use the existing [[https://portal.victorops.com/dash/wikimedia#/team/team-4bCl5lW3... [16:34:22] 10Data-Engineering, 10Data-Platform-SRE: nshahquinn-wmf cannot edit in DataHub - https://phabricator.wikimedia.org/T344212 (10BTullis) a:03BTullis [16:49:34] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1098.eqiad.wmnet with OS bullseye completed: - an-worker1098 (**PASS**) - Downtimed on Icinga/Alertmanag... [16:55:45] (03CR) 10Shay Nowick: [C: 03+2] Updated documentation: Image recs schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/948556 (owner: 10Sharvaniharan) [16:56:17] (03Merged) 10jenkins-bot: Updated documentation: Image recs schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/948556 (owner: 10Sharvaniharan) [16:57:47] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk2001.codfw.wmnet with OS bookwo... [16:59:54] 10Data-Platform-SRE: Ensure Data Platform SREs have a contact group in puppet/alerting - https://phabricator.wikimedia.org/T342578 (10herron) [17:00:06] 10Data-Platform-SRE, 10Observability-Alerting, 10observability: Create VictorOps config for new Data Platform SRE team - https://phabricator.wikimedia.org/T344202 (10herron) 05Open→03Stalled Thanks for the info. With this in mind I'm going to stall this victorops setup task while the details of the desi... [17:40:16] 10Data-Engineering, 10Data-Engineering-Jupyter, 10Data-Platform-SRE: Autocomplete is very slow (unusable) in Newpyter - https://phabricator.wikimedia.org/T290008 (10diego) @BTullis , yes, this has been solved in the current environment, thanks! [17:40:30] 10Data-Engineering, 10Data-Engineering-Jupyter, 10Data-Platform-SRE: Autocomplete is very slow (unusable) in Newpyter - https://phabricator.wikimedia.org/T290008 (10diego) 05Open→03Resolved [18:12:30] 10Data-Platform-SRE: Upgrade Spark to a version that has long term Iceberg support - https://phabricator.wikimedia.org/T338057 (10xcollazo) [18:13:05] 10Data-Platform-SRE: Upgrade Spark to a version that has long term Iceberg support - https://phabricator.wikimedia.org/T338057 (10xcollazo) (Rerouted to #data-platform-sre as per https://docs.google.com/presentation/d/1f7iBHd3QmGVyejOmPOjIm3WtgCZ3i7qcfm3bI5W5ALw/edit#slide=id.g225ebfe300a_0_15 ) [18:16:30] 10Data-Platform-SRE: Bump Spark to 3.3.x or 3.4.x line. - https://phabricator.wikimedia.org/T344266 (10xcollazo) [18:45:44] 10Data-Platform-SRE, 10Discovery-Search, 10Wikidata, 10Wikidata-Query-Service: Rename usages of whitelist to allowlist in query service rdf repo - https://phabricator.wikimedia.org/T344284 (10RKemper) [18:46:14] 10Data-Platform-SRE, 10Discovery-Search, 10Wikidata, 10Wikidata-Query-Service: Rename usages of whitelist to allowlist in query service rdf repo - https://phabricator.wikimedia.org/T344284 (10RKemper) [18:48:20] 10Data-Platform-SRE: Upgrade Spark to a version that has long term Iceberg support - https://phabricator.wikimedia.org/T338057 (10xcollazo) [19:48:39] (MediawikiPageContentChangeEnrichAvailability) firing: ... [19:48:39] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [20:10:41] 10Data-Platform-SRE: Bump Spark to 3.3.x or 3.4.x line. - https://phabricator.wikimedia.org/T344266 (10xcollazo) [20:10:50] 10Data-Platform-SRE: Upgrade Spark to a version that has long term Iceberg support - https://phabricator.wikimedia.org/T338057 (10xcollazo) [20:13:10] 10Data-Platform-SRE: Upgrade Spark to a version that has long term Iceberg support - https://phabricator.wikimedia.org/T338057 (10xcollazo) Copying rationale to move forward with this work from T344266: > While iterating on an Apache Iceberg MERGE INTO on T340861, we hit T342587, in which the MERGE job genera... [20:14:00] 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10xcollazo) [20:40:20] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Move whitelist.txt from WDQS deploy repo into puppet and rename it to "allow list" - https://phabricator.wikimedia.org/T343856 (10bking) [20:49:23] 10Data-Platform-SRE, 10Infrastructure-Foundations: Puppet: consider skipping SPDX enforcement on text files - https://phabricator.wikimedia.org/T344291 (10bking) [21:32:24] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1012.eqiad.wmnet with OS bullseye [21:33:00] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1013.eqiad.wmnet with OS bullseye [21:42:55] 10Data-Platform-SRE: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361 (10bking) [21:51:19] 10Data-Platform-SRE: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361 (10bking) [21:52:18] 10Data-Platform-SRE: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361 (10bking) This issue came up again when working on T343856 . Broadening the scope of this ticket to include all WDQS startup scripts, not just categories. [21:55:48] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10bking) ===== NODE GROUP ===== (19) wdqs[2007-2022].codfw.wmnet,wdqs[1014-1016].eqiad.wmnet ----- OUTPUT of 'cat /etc/debian_version' ----- 11.7 ===== NODE GROUP ===== (8) wdqs[1006-1013].eqiad.wmn... [22:22:24] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1013.eqiad.wmnet with OS bullseye completed: - wdqs1013 (**WARN**) - Downtimed on Icinga/Alertman... [22:27:59] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1012.eqiad.wmnet with OS bullseye completed: - wdqs1012 (**WARN**) - Downtimed on Icinga/Alertman... [23:48:39] (MediawikiPageContentChangeEnrichAvailability) firing: ... [23:48:39] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [23:52:03] (03PS1) 10Clare Ming: DNM POC for extracting Metrics Platform fragments [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/949141 (https://phabricator.wikimedia.org/T343557) [23:52:37] (03CR) 10CI reject: [V: 04-1] DNM POC for extracting Metrics Platform fragments [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/949141 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [23:57:17] (03PS2) 10Clare Ming: DNM POC for extracting Metrics Platform fragments [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/949141 (https://phabricator.wikimedia.org/T343557)