[00:04:23] (SystemdUnitFailed) firing: (14) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:12:10] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10Hannah_Bast) Yes, https://qlever.cs.uni-freiburg.de/api/dblp is the URL for API calls, whereas https://qlever.cs.uni-fre... [01:19:23] (SystemdUnitFailed) firing: (14) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:31:47] 10Data-Engineering: [Data Quality] decrease line width and point size in Airflow metrics dashboard - https://phabricator.wikimedia.org/T356359 (10Ahoelzl) [02:34:36] 10Data-Engineering (Sprint 8), 10Patch-For-Review: NEW BUG REPORT 12 new wikis missing from the mediawiki_history dataset - https://phabricator.wikimedia.org/T349743 (10Ahoelzl) a:03JAllemandou [02:39:32] 10Data-Engineering (Sprint 8): [Dataset Config Store] - Define config API for navigationtiming and implement local development instance - https://phabricator.wikimedia.org/T355542 (10Ahoelzl) [02:43:05] 10Data-Engineering (Sprint 8): [Refine Refactoring] Orchestrate Airflow execution of navigationtiming from config store - https://phabricator.wikimedia.org/T356360 (10Ahoelzl) [02:47:59] 10Data-Engineering (Sprint 8): [Refine Refactoring] [Spike] Define a concept and provide a PoC for dynamic DAG execution in Airflow - https://phabricator.wikimedia.org/T356362 (10Ahoelzl) [02:49:20] 10Data-Engineering, 10Data Pipelines: [Refine refactoring] Refactor and migrate navigationtiming to Airflow - https://phabricator.wikimedia.org/T356192 (10Ahoelzl) [02:50:06] 10Data-Engineering (Sprint 8): [Refine Refactoring] Orchestrate Airflow execution of navigationtiming from config store - https://phabricator.wikimedia.org/T356360 (10Ahoelzl) [02:50:56] 10Data-Engineering (Sprint 8): [Refine Refactoring] Refactor refinery code for compatibility with Airflow integration - https://phabricator.wikimedia.org/T356363 (10Ahoelzl) [02:55:08] 10Data-Engineering (Sprint 8): [Dataset Config Store] - Define config API for navigationtiming and implement local development instance - https://phabricator.wikimedia.org/T355542 (10Ahoelzl) a:03tchin [02:59:11] 10Data-Engineering (Sprint 8): [Maintenance] Migrate Gitlab CI to blubber - https://phabricator.wikimedia.org/T356364 (10Ahoelzl) [03:00:22] 10Data-Engineering (Sprint 8): [Refine Refactoring] Orchestrate Airflow execution of navigationtiming from config store - https://phabricator.wikimedia.org/T356360 (10Ahoelzl) [03:00:33] 10Data-Engineering (Sprint 8): [Refine Refactoring] [Spike] Define a concept and provide a PoC for dynamic DAG execution in Airflow - https://phabricator.wikimedia.org/T356362 (10Ahoelzl) [03:00:46] 10Data-Engineering (Sprint 8): [Refine Refactoring] Refactor refinery code for compatibility with Airflow integration - https://phabricator.wikimedia.org/T356363 (10Ahoelzl) [05:19:26] (SystemdUnitFailed) firing: (13) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:58:06] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products: No page views by country data for Turkey - https://phabricator.wikimedia.org/T355404 (10Chidgk1) What privacy reasons please? [08:06:27] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products: No page views by country data for Turkey - https://phabricator.wikimedia.org/T355404 (10Chidgk1) Perhaps stats for all countries below a certain level of democracy are hidden? If so how is this decided? [08:17:54] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products: No page views by country data for Turkey - https://phabricator.wikimedia.org/T355404 (10Chidgk1) Ah ok I found https://foundation.wikimedia.org/wiki/Legal:Country_and_Territory_Protection_List Could this be linked from the st... [08:21:19] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the MiMoTextBase SPARQL endpoint - https://phabricator.wikimedia.org/T351488 (10HinMar) @RKemper : Thank you for your message. The project has ended, but we still kindly ask you to whitelist this endpoint. We at the Trier... [08:23:47] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products: No page views by country data for Turkey - https://phabricator.wikimedia.org/T355404 (10Chidgk1) 05Declined→03Open https://foundation.wikimedia.org/wiki/Legal:Country_and_Territory_Protection_List says “ The Wikimedia Found... [09:11:34] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 5.997% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:20:49] (SystemdUnitFailed) firing: (13) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:25:13] btullis, brouberol: quick update on the live dashboard for today's superset upgrade: [09:25:34] 1) https://gerrit.wikimedia.org/r/c/operations/puppet/+/994811 is ready to be merged after the upgrade for requestctl-generator [09:26:09] 2) unfortunately I have a meeting starting at 10 UTC, right when the upgrade shuld start, so I'll be able to check things only a bit later [09:26:44] 3) from my tests on superset-next everything related to the live dashboard should work fine, with some nice improvements of the new version [09:27:49] * I will update the Help tab later to adapt some description on where some things are in the context menus or the fact that clicking on an item in a table does filter for it automatically (cross filters) and check that they work with requestctl-generator (from a quick look at the API response it should) [09:30:38] 4) I'll adapt a virtual dataset query I use later because needs tweaking, but is not used by others for now, so not an issue [09:30:46] * volans EOF [09:31:41] volans: Great! Thanks for all the feedback and prep. [09:33:08] thank you for the upgrade! [09:46:39] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Iceberg Migration] Migrate interlanguage tables to Iceberg - https://phabricator.wikimedia.org/T352671 (10CodeReviewBot) joal merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/590 Add iceberg version of interlanguage_da... [09:49:33] !log deploying airflow for interlanguage_navigation in Iceberg [09:49:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:02:00] will you be coordinating here the upgrade? [10:03:31] Yes, plus on https://phabricator.wikimedia.org/T335356 [10:06:10] ack thx [10:07:16] (03CR) 10Btullis: [V: 03+2 C: 03+2] Improve the display of nested columns from presto [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/994213 (https://phabricator.wikimedia.org/T340144) (owner: 10Btullis) [10:08:19] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [10:08:33] !log deploying Superset 3.1.0 to an-tool1010 with https://gerrit.wikimedia.org/r/c/analytics/superset/deploy/+/994213 [10:08:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:19:51] joal: I forgot to deploy https://gerrit.wikimedia.org/r/c/analytics/refinery/+/992944 yesterday [10:20:04] LMK when you're done with your deployment [10:21:01] Hi phuedx - I was starting to tackle this one as well :) [10:21:16] Let's deploy that now, so that we fix sqoop :) [10:21:26] Do I let you do, or do prefer I do it? [10:21:48] And actually, I don't think it was your fault, I read in the etherpad that the patch was deployed before, while it was not [10:33:18] ping phuedx ? [10:35:20] joal: I can do it now [10:35:42] (03CR) 10Phuedx: [V: 03+2 C: 03+2] "🚂" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/992944 (owner: 10Aqu) [10:35:48] The Superset upgrade to version 3.1.0 is complete. Everything looks ok so far, so you can go ahead and use it. [10:36:41] phuedx: ok :) Sorry, I didn't mean to put pressure, but rather to decide whether you or me :) [10:36:51] Understood ^^ [10:36:58] We have also taken the opportunity to deploy T340144 which causes nested columns in presto to be expanded and displayed. [10:36:58] T340144: Evaluate the PRESTO_EXPAND_DATA feature flag in superset - https://phabricator.wikimedia.org/T340144 [10:37:35] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Pipelines, 10Patch-For-Review: Evaluate the PRESTO_EXPAND_DATA feature flag in superset - https://phabricator.wikimedia.org/T340144 (10BTullis) This is now deployed to production. [10:38:04] btullis: that's awesome :) could you let people in analytics-world know about that? I'd like to use that opportunity to ask them if this is ok for us to remove hue :) [10:38:49] joal: Will do. [10:39:12] !log phuedx@deploy2002 Started deploy [analytics/refinery@0d8e976]: analytics/refinery: Remove trvwikisource from scoop list [10:39:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:47:02] (03CR) 10Joal: [C: 03+2] "LGTM! Thanks Guillaume :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/994674 (owner: 10Gehel) [10:50:09] !log phuedx@deploy2002 Finished deploy [analytics/refinery@0d8e976]: analytics/refinery: Remove trvwikisource from scoop list (duration: 10m 20s) [10:50:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:51:08] !log phuedx@deploy2002 Started deploy [analytics/refinery@0d8e976] (thin): Remove trvwikisource from scoop list [10:51:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:51:10] !log phuedx@deploy2002 Finished deploy [analytics/refinery@0d8e976] (thin): Remove trvwikisource from scoop list (duration: 00m 05s) [10:51:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:51:38] !log phuedx@deploy2002 Started deploy [analytics/refinery@0d8e976] (hadoop-test): Remove trvwikisource from scoop list [10:51:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:51:40] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE-Access-Requests, 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10ABran-WMF) p:05High→03Medium [10:55:06] !log phuedx@deploy2002 Finished deploy [analytics/refinery@0d8e976] (hadoop-test): Remove trvwikisource from scoop list (duration: 03m 30s) [10:55:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:55:16] joal: ^ [10:55:48] (03CR) 10Joal: [C: 03+2] "LGTM! Thanks again Guillaume :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/994753 (owner: 10Gehel) [10:57:33] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10brouberol) [10:57:37] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Pipelines, 10Patch-For-Review: Evaluate the PRESTO_EXPAND_DATA feature flag in superset - https://phabricator.wikimedia.org/T340144 (10brouberol) 05Open→03Resolved [10:57:39] (03Merged) 10jenkins-bot: Simplifies CountryDatabaseReader. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/994674 (owner: 10Gehel) [10:57:43] Taking a short break [11:00:46] phuedx: Code has not been deployed to HDFS yet, right (sorry to bother during the break :) [11:06:17] (03Merged) 10jenkins-bot: Simplify GeocodeDatabaseReader. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/994753 (owner: 10Gehel) [11:16:44] 10Quarry, 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407 (10fnegri) [11:18:35] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2023/2024-Q3-Q4): Replace Quarry with an installation of Superset - https://phabricator.wikimedia.org/T169452 (10fnegri) [11:18:51] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:20:38] joal: Doing that now [11:21:49] !log deploying the new spark-operator images based on JRE 8 for T354273 [11:21:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:24:12] 10Data-Engineering, 10EventStreams, 10MediaWiki-General, 10Privacy Engineering, and 3 others: Create Mediawiki "oversightprotect" action that suppresses usernames of all edits of a page - https://phabricator.wikimedia.org/T354577 (10MBH) Will this hide data or suppress it? I think, hiding may be better on... [11:26:04] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/41 Bump wmfdata to version 2.3.0 and add depe... [11:26:07] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/41 Bump wmfdata... [11:26:20] (03CR) 10Joal: "Some nits :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/994697 (https://phabricator.wikimedia.org/T352672) (owner: 10Aqu) [11:29:07] !log Deployed refinery onto hdfs [11:29:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:52:11] Thanks phuedx - I'll restart the sqoop wiki [11:55:09] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10BTullis) I have created [[https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/41|this merge reques... [12:03:42] btullis: I'm back from the meeting, sorry went longer than expected [12:04:08] I see https://gerrit.wikimedia.org/r/c/operations/puppet/+/994811 was not meged, it's ok to proceed? [12:04:43] volans: Yes, feel free . I left it for your to do at your convenience. [12:05:09] ok doing [12:05:15] Ack. [12:08:26] !log Restart refinery-sqoop-whole-mediawiki.service after deploy [12:08:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:10:49] (SystemdUnitFailed) firing: (13) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:10:55] {done} [12:11:09] requestctl-generator works fine, and the dashbord too [12:12:39] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Unable to reimage elastic2088 and elastic2094 to bullseye - https://phabricator.wikimedia.org/T355830 (10BTullis) a:03BTullis I'm going to have a crack at these reimages, if that's OK. Please let me know if I tread on anyone's t... [12:14:26] (SystemdUnitFailed) firing: (13) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:19:31] Ah crap - we have another project fialing us :( [12:19:38] providing a patch [12:23:30] (03PS1) 10Joal: Remove abwikibooks from the sqoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/995031 [12:23:35] phuedx: --^ [12:23:55] Similarly to yesterday, if not here, I'll push that myself in a few munutes [12:24:29] joal: Eating lunch. Can you push it yourself? [12:25:05] joal: This is a case that might need a little generalisation - incubator projects get pageviews but can't be sqooped. I dunno... [12:29:24] phuedx: It depends how they are defined in DB config for us... [12:29:28] Pushing the thing [12:30:04] (03CR) 10Joal: [V: 03+2 C: 03+2] "Self merging for hotfix" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/995031 (owner: 10Joal) [12:30:54] !log hotfix HDFS sqoop list to prevent an entire redeploy [12:30:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:35:12] !log Rerun refinery-sqoop-whole-mediawiki after hotfix [12:35:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:36:07] There we go! pfew [12:36:42] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Unable to reimage elastic2088 and elastic2094 to bullseye - https://phabricator.wikimedia.org/T355830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host elastic2088.codfw.wmnet wi... [12:37:23] 10Data-Engineering (Sprint 8), 10Discovery-Search, 10Image-Suggestions: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15 - https://phabricator.wikimedia.org/T356030 (10mfossati) [12:39:25] (SystemdUnitFailed) firing: (13) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:45:57] 10Data-Engineering (Sprint 8): [BUG] webrequest analyzer DQ jobs fails to store data - https://phabricator.wikimedia.org/T356401 (10gmodena) [12:46:10] 10Data-Engineering (Sprint 8): [BUG] webrequest analyzer DQ jobs fails to store data - https://phabricator.wikimedia.org/T356401 (10gmodena) a:03gmodena [12:50:30] 10Data-Engineering (Sprint 8): [BUG] webrequest analyzer DQ jobs fails to store data - https://phabricator.wikimedia.org/T356401 (10gmodena) Investigating. ... * On dev enviroments (stat1005, job submitted with user `gmodena`) missing databases are created. * On prod (an-launcher1002, job submitted with user `an... [12:56:51] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10VPS-project-Codesearch, 10Patch-For-Review: Add all Data Engineering gitlab repositories to codesearch - https://phabricator.wikimedia.org/T355069 (10Ladsgroup) 05Open→03Resolved [13:11:50] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 5.826% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [13:21:27] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Unable to reimage elastic2088 and elastic2094 to bullseye - https://phabricator.wikimedia.org/T355830 (10BTullis) I tried a reimage of elastic2008 and it completely hung at the PXE prompt, for at least 20 minutes before I switched... [13:25:21] !log roll-restarting zookeeper on an-conf* for T356382 [13:25:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:34:10] !log roll-restarting zookeeper on druid-public for T356382 [13:34:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:39:24] 10Analytics-Radar, 10Data-Engineering, 10Data-Platform-SRE, 10SRE, and 2 others: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088 (10brouberol) I've taken a couple of hours to whip up this [[ https://gitlab.wikimedia.org/repos/sre/kafka-configurator | PoC ]], very uni... [13:39:58] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Unable to reimage elastic2088 and elastic2094 to bullseye - https://phabricator.wikimedia.org/T355830 (10BTullis) I gave elastic2094 a cold boot, then started the reimage cookbook. It is reporting the following error on the consol... [13:40:40] !log roll-restarting zookeeper on druid-analyticsfor T356382 [13:40:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:18:21] (03PS1) 10Gehel: Cleanup of ISPDatabaseReader. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/995042 [14:21:10] 10Data-Engineering, 10Release-Engineering-Team, 10collaboration-services, 10GitLab (CI & Job Runners), 10Patch-For-Review: Unblock Dockerfile syntax to build images with Gitlab trusted runner - https://phabricator.wikimedia.org/T351792 (10CodeReviewBot) aqu opened https://gitlab.wikimedia.org/repos/data-... [14:35:25] (03PS2) 10Gehel: Cleanup of ISPDatabaseReader. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/995042 [14:45:09] 10Data-Platform-SRE: Set up Spark SQL Server - https://phabricator.wikimedia.org/T324017 (10BTullis) I wonder whether we should revisit this topic again. I was thinking that it would be relatively easy project for us to set up a spark thrift-server on the dse-k8s cluster, given that we have successfully deploye... [14:46:06] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Unable to reimage elastic2088 and elastic2094 to bullseye - https://phabricator.wikimedia.org/T355830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host elastic2088.codfw.wmnet with O... [14:50:45] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Unable to reimage elastic2088 and elastic2094 to bullseye - https://phabricator.wikimedia.org/T355830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host elastic2088.codfw.wmnet wi... [14:54:48] 10Data-Platform-SRE: Set up Spark SQL Server - https://phabricator.wikimedia.org/T324017 (10BTullis) [14:59:33] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Unable to reimage elastic2088 and elastic2094 to bullseye - https://phabricator.wikimedia.org/T355830 (10BTullis) It looks like elastic2094 may have some kind of hardware problem. {F41739906,width=60%} I have tried both cold booti... [15:01:36] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10BTullis) [16:09:25] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene) From the iDRAc interfce we can verify that the hosts have been set to RAID0 and that the virtual drives are visible as expected. {F41740139} {F417... [16:10:03] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE-Access-Requests, 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10mpopov) Before closing this task I'd like to get a confirmation from Goran whether the level of access is... [16:10:58] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host elastic2088.codfw.wmnet with OS... [16:13:44] 10Data-Platform-SRE, 10Discovery-Search, 10Patch-For-Review: Track and clean up object storage used by rdf-streaming-updater - https://phabricator.wikimedia.org/T348685 (10bking) Adding some notes on this topic based on conversation with @dcausse yesterday. Feel free to correct this if I missed anything. **... [16:19:06] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host elastic2088.codfw.wmnet wit... [16:20:06] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10BTullis) I have restarted the reimage cookbook for elastic2088, I realise that I should have selected puppet 7 instead of pupp... [16:20:10] 10Data-Platform-SRE: Set up Spark SQL Server - https://phabricator.wikimedia.org/T324017 (10JAllemandou) While that could be useful, the spark-thrift server doesn't support user impersonation. The StackOverflow ticket I have read points to https://github.com/apache/kyuubi. We could investigate this. [16:38:48] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host elastic2094.codfw.wmnet wit... [16:40:49] (SystemdUnitFailed) firing: (12) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:42:39] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10BTullis) I updated the system BIOS on elastic2094 from version 1.11.2 to version 1.12.1 but it didn't make any difference to t... [16:45:49] 10Analytics, 10AQS2.0, 10Tech-Docs-Team, 10Data Products (Epics Timeline), and 3 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10apaskulin) [16:50:34] 10Data-Platform-SRE: Set up Spark SQL Server - https://phabricator.wikimedia.org/T324017 (10BTullis) > The StackOverflow ticket I have read points to https://github.com/apache/kyuubi. We could investigate this. Very interesting. [16:51:12] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10BTullis) a:05BTullis→03None [16:55:16] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10SRE, 10ops-codfw: Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host elastic2088.codfw.wmnet with OS... [16:58:09] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10BTullis) @bking elastic2088 is now ready for the next step. elastic2094 is still showing an error and needs further investigation. [17:01:06] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE-Access-Requests, 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10AndrewTavis_WMDE) Thank you for the continued attention here, @mpopov. Final investigations of this infra... [17:11:51] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 5.752% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:14:56] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 (10BTullis) [17:28:39] 10Data-Engineering (Sprint 8): [Maintenance] Migrate one additional ReportUpdater job - https://phabricator.wikimedia.org/T356424 (10Ahoelzl) [17:29:07] 10Data-Engineering (Sprint 8): [Maintenance] Migrate one additional ReportUpdater job - https://phabricator.wikimedia.org/T356424 (10Ahoelzl) [18:41:18] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10nshahquinn-wmf) >>! In T345482#9505663, @BTullis wrote: > However, I think we should take on that work seprately, if you don't mind... [18:51:24] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10xcollazo) >I'd like us to be able to implement JupyterHub on Kubernetes instead of users running it on individual stat servers. Doi... [19:34:51] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Discovery-Search (Current work), 10Patch-For-Review: Enable cross federation between experimental WDQS endpoints - https://phabricator.wikimedia.org/T355888 (10Gehel) 05Open→03Resolved a:03Gehel [19:34:59] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10Gehel) [19:41:54] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10RKemper) >>! In T339347#9504488, @Hannah_Bast wrote: > Yes, https://qlever.cs.uni-freiburg.de/api/dblp is the URL for AP... [19:44:51] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Allow federated queries with the MiMoTextBase SPARQL endpoint - https://phabricator.wikimedia.org/T351488 (10RKemper) >>! In T351488#9505091, @HinMar wrote: > @RKemper : Thank you for your message. The project has ended, but we... [20:01:13] 10Data-Engineering, 10Cassandra, 10Data Pipelines: Create puppet defined type for adding/updating/deleting secrets or other small files on HDFS - https://phabricator.wikimedia.org/T323692 (10Eevans) [20:01:15] 10Data-Engineering-Kanban, 10Data-Platform-SRE, 10Cassandra, 10Shared-Data-Infrastructure, 10User-Eevans: Properly add aqsloader user (w/ secrets) - https://phabricator.wikimedia.org/T305600 (10Eevans) [20:01:31] 10Data-Engineering, 10Cassandra, 10Data Pipelines: Create puppet defined type for adding/updating/deleting secrets or other small files on HDFS - https://phabricator.wikimedia.org/T323692 (10Eevans) 05Open→03Stalled [20:01:35] 10Data-Engineering-Kanban, 10Data-Platform-SRE, 10Cassandra, 10Shared-Data-Infrastructure, 10User-Eevans: Properly add aqsloader user (w/ secrets) - https://phabricator.wikimedia.org/T305600 (10Eevans) [20:05:00] 10Data-Engineering (Sprint 8): [BUG] webrequest analyzer DQ jobs fails to store data - https://phabricator.wikimedia.org/T356401 (10gmodena) > On prod (an-launcher1002, job submitted with user analytics) missing databases are not created. The write operation was failing, but an exception got swallowed and appli... [20:08:39] 10Data-Engineering, 10Data-Platform-SRE, 10Cassandra, 10Pageviews-API, 10User-Elukey: Improve user management for AQS Cassandra - https://phabricator.wikimedia.org/T142073 (10Eevans) [20:10:33] 10Data-Engineering, 10Data-Platform-SRE, 10Cassandra, 10Pageviews-API, 10User-Elukey: Improve user management for AQS Cassandra - https://phabricator.wikimedia.org/T142073 (10Eevans) [20:12:21] 10Data-Engineering, 10Data-Platform-SRE, 10Cassandra, 10Pageviews-API, 10User-Elukey: Improve user management for AQS Cassandra - https://phabricator.wikimedia.org/T142073 (10Eevans) 05Open→03Resolved a:03Eevans The items in-scope for this issue are complete; Closing. [20:12:59] 10Data-Engineering, 10Cassandra: Audit and update AQS Cassandra roles & grants - https://phabricator.wikimedia.org/T313877 (10Eevans) [20:13:14] 10Data-Engineering, 10Cassandra: Audit and update AQS Cassandra roles & grants - https://phabricator.wikimedia.org/T313877 (10Eevans) p:05Triage→03Medium [20:21:59] 10Data-Engineering, 10Cassandra: Audit and update AQS Cassandra roles & grants - https://phabricator.wikimedia.org/T313877 (10Eevans) [20:34:28] 10Data-Engineering (Sprint 8): [Maintenance] Migrate Gitlab CI to blubber - https://phabricator.wikimedia.org/T356364 (10Ahoelzl) a:03Antoine_Quhen [20:34:57] 10Data-Engineering (Sprint 8): [Maintenance] Migrate Gitlab CI to blubber - https://phabricator.wikimedia.org/T356364 (10Ahoelzl) Improvement: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/602 [20:40:49] (SystemdUnitFailed) firing: (12) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:11:08] (03PS1) 10Gmodena: IcebergWriter: don't create missing tables if absent [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/995101 (https://phabricator.wikimedia.org/T356401) [21:11:51] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 5.679% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:15:27] 10Data-Engineering (Sprint 8): Add `event.app_donor_experience` fields to event sanitization allowlist - https://phabricator.wikimedia.org/T356214 (10SNowick_WMF) [21:16:05] (KafkaReplicationFactorTooLow) firing: (2) Kafka topic codfw.app_places_interaction replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [21:21:05] (KafkaReplicationFactorTooLow) resolved: (2) Kafka topic codfw.app_places_interaction replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [21:29:39] 10Data-Engineering, 10Cassandra, 10Structured Data Engineering, 10Structured-Data-Backlog: image suggestions DAG should not use aqsloader Cassandra role - https://phabricator.wikimedia.org/T356446 (10Eevans) [21:37:19] 10Data-Engineering-Kanban, 10Data-Platform-SRE, 10Cassandra, 10Shared-Data-Infrastructure, 10User-Eevans: Properly add aqsloader user (w/ secrets) - https://phabricator.wikimedia.org/T305600 (10Eevans) [21:49:51] (03PS1) 10Bearloga: Add app_donor_experience to allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/995104 (https://phabricator.wikimedia.org/T356214) [21:52:53] 10Data-Engineering (Sprint 8), 10Patch-For-Review: Add `event.app_donor_experience` fields to event sanitization allowlist - https://phabricator.wikimedia.org/T356214 (10mpopov) Shay met with me for consultation on this. @lbowmaker: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/995104 is ready for review [22:41:40] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Observability-Alerting: Fix "requests triggering circuit breakers" Elastic alert - https://phabricator.wikimedia.org/T355795 (10bking) Per IRC conversation with cwhite: "We usually generate metrics from logs then alert on those metrics. An example can be found... [22:56:17] 10Data-Engineering, 10Cassandra: Audit and update AQS Cassandra roles & grants - https://phabricator.wikimedia.org/T313877 (10Eevans) [22:56:21] 10Data-Engineering-Kanban, 10Data-Platform-SRE, 10Cassandra, 10Shared-Data-Infrastructure, 10User-Eevans: Properly add aqsloader user (w/ secrets) - https://phabricator.wikimedia.org/T305600 (10Eevans) [22:57:02] 10Data-Engineering, 10Cassandra: Audit and update AQS Cassandra roles & grants - https://phabricator.wikimedia.org/T313877 (10Eevans) [23:01:41] 10Data-Engineering-Kanban, 10Data-Platform-SRE, 10Cassandra, 10Shared-Data-Infrastructure, 10User-Eevans: Properly add aqsloader user (w/ secrets) - https://phabricator.wikimedia.org/T305600 (10Eevans) [23:02:33] 10Data-Engineering-Kanban, 10Data-Platform-SRE, 10Cassandra, 10Shared-Data-Infrastructure, 10User-Eevans: Properly add aqsloader user (w/ secrets) - https://phabricator.wikimedia.org/T305600 (10Eevans) [23:02:46] 10Data-Engineering-Kanban, 10Data-Platform-SRE, 10Cassandra, 10Shared-Data-Infrastructure, 10User-Eevans: Properly add aqsloader user (w/ secrets) - https://phabricator.wikimedia.org/T305600 (10Eevans) [23:18:40] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Change TLS/load balancer configuration for cloudelastic - https://phabricator.wikimedia.org/T355720 (10bking) 05Open→03Resolved [23:18:43] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10bking) [23:19:02] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Change TLS/load balancer configuration for cloudelastic - https://phabricator.wikimedia.org/T355720 (10bking) After the above changes, we were able to add our canary back to LVS. It's passing health checks and receiving traffic, so we should ju...