[01:28:08] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cephosd2001.codfw.wmnet with OS bullseye [01:28:11] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye [01:28:15] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cephosd2003.codfw.wmnet with OS bullseye [03:12:14] (DiskSpace) firing: Disk space an-test-ui1001:9100:/ 3.015% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-ui1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:12:29] (DiskSpace) firing: Disk space an-test-ui1001:9100:/ 3.075% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-ui1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:47:30] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [08:48:58] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: Configure permissions for the spark-history principal for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10brouberol) [08:49:31] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [08:59:13] elukey: would you have a bit of time to talk about our service mesh? I don't know much about istio/envoy, and a bit of guidance would be appreciated (if you do know about it). Thank you! [09:23:13] 10Data-Platform-SRE (23/24 Q2 1): [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Gehel) [09:23:52] 10Data-Platform-SRE (23/24 Q2 1): Reduce impact of Elastic snapshots - https://phabricator.wikimedia.org/T351475 (10Gehel) [09:23:57] 10Data-Platform-SRE (23/24 Q2 1), 10serviceops-radar, 10Discovery-Search (Current work), 10Epic: Determine and control cirrus streaming updater's usage of MWAPI resources - https://phabricator.wikimedia.org/T349848 (10Gehel) [09:23:59] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 1), 10Event-Platform: [Event Platform] Define Flink k8s operator SLO - https://phabricator.wikimedia.org/T345914 (10Gehel) [09:24:03] 10Data-Platform-SRE (23/24 Q2 1), 10Release-Engineering-Team, 10Scap: "scap deploy"'s config-deploy should check for broken symlinks - https://phabricator.wikimedia.org/T342162 (10Gehel) [09:24:05] 10Data-Platform-SRE (23/24 Q2 1): Upgrade matomo (piwik.wikimedia.org) to latest stable version - https://phabricator.wikimedia.org/T351552 (10Gehel) [09:24:07] 10Data-Engineering, 10Data-Platform-SRE (23/24 Q2 1), 10SRE: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYMOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10Gehel) [09:24:11] 10Data-Platform-SRE (23/24 Q2 1): Deploy ceph radosgw processes to data-engineering cluster - https://phabricator.wikimedia.org/T330152 (10Gehel) [09:24:15] 10Data-Engineering, 10Data-Platform-SRE (23/24 Q2 1): Deprecate Hue and stop the services - https://phabricator.wikimedia.org/T341895 (10Gehel) [09:24:19] 10Data-Platform-SRE (23/24 Q2 1): Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10Gehel) [09:24:23] 10Data-Engineering, 10Data-Platform-SRE (23/24 Q2 1), 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10Gehel) [09:24:27] 10Data-Engineering, 10Data-Platform-SRE (23/24 Q2 1), 10Foundational Technology Requests: Enable the Marketing Campaigns Reporting plugin for matomo - https://phabricator.wikimedia.org/T319013 (10Gehel) [09:24:31] 10Data-Platform-SRE (23/24 Q2 1): Check home/HDFS leftovers of ntsako - https://phabricator.wikimedia.org/T343189 (10Gehel) [09:24:35] 10Data-Platform-SRE (23/24 Q2 1): Check home/HDFS leftovers of andyrussg - https://phabricator.wikimedia.org/T338234 (10Gehel) [09:24:39] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 1), 10Patch-For-Review: [Data Platform] Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10Gehel) [09:24:43] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 1), 10Observability-Metrics, 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10Gehel) [09:26:25] 10Data-Platform-SRE (23/24 Q2 1): Cirrus-streaming-updater test: validate relforge indices are correctly updated - https://phabricator.wikimedia.org/T350186 (10Gehel) [09:26:32] 10Data-Platform-SRE (23/24 Q2 1): Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10Gehel) [09:27:14] 10Data-Platform-SRE (23/24 Q2 1): Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 (10Gehel) [09:27:19] 10Data-Platform-SRE (23/24 Q2 1): Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Gehel) [09:27:22] 10Data-Platform-SRE (23/24 Q2 1): [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10Gehel) [09:27:24] 10Data-Platform-SRE (23/24 Q2 1): Alert in need of triage: SmartNotHealthy (instance an-worker1086:9100) - https://phabricator.wikimedia.org/T352168 (10Gehel) [09:27:26] 10Data-Platform-SRE (23/24 Q2 1): Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10Gehel) [09:27:32] 10Data-Platform-SRE (23/24 Q2 1): Create a helm chart for the spark-history service - https://phabricator.wikimedia.org/T351722 (10Gehel) [09:27:34] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10dcausse) I think (please ignore if already done) we're still missing the partition count change on kafka-jumbo for both topics. [09:27:44] 10Data-Platform-SRE (23/24 Q2 1): Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10Gehel) [09:27:51] 10Data-Platform-SRE (23/24 Q2 1): Simplify query.wikidata.org LDF endpoint config - https://phabricator.wikimedia.org/T352111 (10Gehel) [09:27:56] 10Data-Platform-SRE (23/24 Q2 1): Create dashboards/alerts for new Cirrus Streaming Updater - https://phabricator.wikimedia.org/T349772 (10Gehel) [09:28:01] 10Data-Platform-SRE (23/24 Q2 1): Refactor sre.wdqs.data-transfer to use new spicerack class api - https://phabricator.wikimedia.org/T347624 (10Gehel) [09:28:16] 10Data-Platform-SRE (23/24 Q2 1): Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10Gehel) [09:28:20] 10Data-Platform-SRE (23/24 Q2 1): Check log rotation settings on airflow instances - https://phabricator.wikimedia.org/T339015 (10Gehel) [09:28:27] 10Data-Platform-SRE (23/24 Q2 1): Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361 (10Gehel) [09:28:31] 10Data-Platform-SRE (23/24 Q2 1): Check home/HDFS leftovers of ryanmax - https://phabricator.wikimedia.org/T325527 (10Gehel) [10:20:41] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: Configure permissions for the spark-history principal for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10brouberol) Turns out the issue isn't with the principal itself, but with rwx permissions in hdfs directly. Running the f... [10:35:48] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: Configure permissions for the spark-history principal for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10brouberol) If we can confirm that HDFS sees the UID of the spark-history server user (185), then we would probably need... [10:40:47] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: Configure the spark event dir in the spark3 defaults - https://phabricator.wikimedia.org/T352849 (10brouberol) [10:41:19] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [10:46:06] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: Configure permissions for the spark-history principal for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10brouberol) A final approach would be to re-use an existing users (such as `analytics (uid=906)` or `analytics-privatedat... [10:53:01] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: Configure permissions for the spark-history principal for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10brouberol) Picking the last solution, we'd need to make `/var/log/spark` with owner `hdfs` and group owner `analytics-pr... [10:54:26] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: Build an image for spark-history with user uid=909 - https://phabricator.wikimedia.org/T352850 (10brouberol) [10:56:49] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Configure permissions for the spark-history principal for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10Gehel) [10:57:01] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Configure the spark event dir in the spark3 defaults - https://phabricator.wikimedia.org/T352849 (10Gehel) [10:57:15] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Build an image for spark-history with user uid=909 - https://phabricator.wikimedia.org/T352850 (10Gehel) [11:14:20] brouberol: o/ sorry I am afk this morning, we can discuss it later on if you want [11:15:09] tl;dr is that istio is used as a gateway by all clusters, and the Istio service mesh (istio sidecar) only by ML [11:17:14] (DiskSpace) firing: Disk space an-test-ui1001:9100:/ 3.071% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-ui1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:24:49] joal: xcollazo - This puppet change is ready to tell refine to use the new version of refinery-source: https://gerrit.wikimedia.org/r/c/operations/puppet/+/980445 [11:53:30] Elukey: whenever you can, there’s no rush at all. Thank you! [12:10:04] 10Data-Engineering, 10Movement-Insights, 10Traffic: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10Milimetric) [[ https://github.com/wikimedia/operations-puppet/blob/8a82461c968c7ba44e786ccdbc05f240369a9d57/modules/varnish/templates/analytics.inc.vcl.erb#... [12:36:00] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Configure permissions for the spark-history principal for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10CodeReviewBot) brouberol opened https://gitlab.wikimedia.org/repos/data-eng... [12:52:49] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Configure permissions for the spark-history principal for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10brouberol) For the test hadoop cluster: ` brouberol@an-test-master1001:~$ s... [12:54:27] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Configure permissions for the spark-history principal for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10brouberol) When I executed the previous commands, I immediately started see... [12:59:38] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: Define the spark history service helmfile deployments - https://phabricator.wikimedia.org/T352860 (10brouberol) [13:01:25] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: Deploy the spark history services - https://phabricator.wikimedia.org/T352861 (10brouberol) [13:02:52] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: Configure the YARN resource manager with the spark history service URL - https://phabricator.wikimedia.org/T352863 (10brouberol) [13:03:53] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [13:04:09] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Configure the YARN resource manager with the spark history service URL - https://phabricator.wikimedia.org/T352863 (10brouberol) [13:04:19] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Define the spark history service helmfile deployments - https://phabricator.wikimedia.org/T352860 (10brouberol) [13:04:42] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [13:05:35] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Deploy the spark history services - https://phabricator.wikimedia.org/T352861 (10brouberol) [13:08:17] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Build an image for spark-history with user uid=909 - https://phabricator.wikimedia.org/T352850 (10brouberol) [13:08:28] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Build an image for spark-history with user uid=909 - https://phabricator.wikimedia.org/T352850 (10brouberol) Change request: https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/8 [13:09:05] 10Data-Engineering (Sprint 6), 10Patch-For-Review: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs - https://phabricator.wikimedia.org/T349532 (10CodeReviewBot) aqu merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/549 Provide Airflow metrics develo... [13:14:33] 10Data-Platform-SRE: Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10Gehel) p:05Medium→03High [13:14:49] 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate performance differences between wdqs2022 and older hosts - https://phabricator.wikimedia.org/T336443 (10Gehel) p:05Triage→03Medium [13:27:36] 10Data-Engineering, 10Data Products, 10Wikidata, 10Wikidata-Query-Service: Publish WDQS JNL files to dumps.wikimedia.org - https://phabricator.wikimedia.org/T344905 (10Gehel) This needs to be driven by a product need. Removing DPE-SRE from this ticket until things are moving and we are needed. [13:29:57] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Automatically depool wdqs servers that are "lagged" - https://phabricator.wikimedia.org/T270614 (10Gehel) p:05High→03Medium [13:33:26] 10Data-Platform-SRE, 10observability, 10Epic: Change data platform-related IRC channels to improve communication - https://phabricator.wikimedia.org/T352783 (10Gehel) p:05Triage→03Medium [13:33:30] 10Data-Platform-SRE: Decommission dbstore1005 - https://phabricator.wikimedia.org/T351925 (10Gehel) p:05Triage→03High [13:33:54] 10Data-Platform-SRE: Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10Gehel) p:05Triage→03High [13:34:33] 10Data-Platform-SRE, 10Discovery-Search, 10Elasticsearch: cleanup the custom elasticsearch_${version}@ systemd unit in favor of an override configuration - https://phabricator.wikimedia.org/T218315 (10Gehel) p:05High→03Medium [13:36:33] 10Data-Platform-SRE: Check home/HDFS leftovers of jbond - https://phabricator.wikimedia.org/T352511 (10Gehel) p:05Triage→03Low [13:40:22] brouberol: back! [13:40:29] so longer explanation :) [13:41:29] Historic background: the serviceops team was the first one creating services on k8s, and at the time they decided not to use a service mesh solution like istio for pod->service (egress) communications [13:42:23] instead they use an envoy proxy that is configured to listen on a pre-defined set of ports (the canonical list is in puppet, that is also available on k8s via helmfile configs on deploy20020 [13:42:27] ) [13:43:36] so, for example, you configure your service to be able to contact the mw-api, and the serviceops' scaffolding/modules in deployment-charts take care of adding the envoy sidecar with the config for the mw-api endpoint, network rules, etc.. [13:43:56] but the application running in the pod needs to explicitly use the proxy [13:44:16] for example, stuff like: https://localhost:6500/w/api.php etc.. [13:44:51] and you get a lot of metrics by default, like https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1 [13:45:17] the same scheme is used for "bare metal" services/nodes, for example on mwXXXX hosts there is the same envoy proxy [13:45:31] and its metrics end up in https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1 [13:45:59] it was a way to have a consistent local proxy on bare metal and k8s [13:46:02] and it works really weel [13:46:03] *well [13:46:18] nice, that makes a lof of sense [13:46:28] more specifically, in deployment-charts' modules you can find "mesh" [13:46:41] that takes care of setting up what I wrote above [13:46:43] yep, I've been playing with it since this morning [13:46:51] super, and do you know https://gitlab.wikimedia.org/repos/sre/sextant ? [13:46:51] brouberol: a few questions from Olja on T300102. You're probably the best person for a quick answer [13:46:52] T300102: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 [13:47:22] elukey: my understanding is that sextant is used by ./create_new_service.sh, right? [13:47:46] Which I ran to create the service defined in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/978629 (WIP) [13:47:49] brouberol: exactly, but you can also use it when you want to upgrade module's versions in a chart that already exists [13:47:53] gehel: 👀 [13:48:10] okok perfect :) [13:48:16] oh ok, so it's both a scaffolding tool and our chart package manager? [13:48:28] basically yes [13:48:39] but there is another bit of the story, so you have the full picture [13:48:48] up to this point, we weren't using Istio at all [13:49:15] then the ML team entered in the k8s world with KServe, that uses Istio behind the scenes (Gateway and sidecar) [13:49:44] long story short, on ml-serve we do use the Istio sidecar proxy, that is basically envoy but the application is not aware of it [13:50:14] there is some horror in iptables (created by a specific istio container) that "transparently" proxies to localhost [13:50:27] (of course there is) [13:50:40] it is not that easy to use and configure, so we don't recommend to use it elsewhere [13:50:44] and it is only available on ml-serve [13:50:51] noted, thanks [13:51:06] Last bit (I promise) - both serviceops and ml use Istio as "gateway" though [13:51:15] since we both needed a shared ingress solution [13:51:27] the module "ingress" in deployment-charts takes care of exactly that [13:51:28] 10Data-Platform-SRE: Try to improve the LDAP integration for Superset user account creation - https://phabricator.wikimedia.org/T297120 (10Gehel) [13:51:31] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10Gehel) [13:51:36] right, and this is what most interests me (as this is what caused me issues yersterday [13:51:41] ) [13:52:08] I saw the issue, IIRC you need to have at least one service configured to be able to pass health checks [13:52:14] exactly [13:52:22] that's what I discovered as well [13:52:23] (if you configured the 404 probe as the other endpoints) [13:53:02] and with ingress.enabled: true in our chart, the 30443 nodeport is now open [13:53:12] so should I reeploy the LVS server conf, pybal should be happy now [13:54:02] now, the thing that I'm not 100% sure of is: can I use ingress in my chart without relying on the service mesh. This is where things get confusing for me [13:54:12] yes definitely [13:54:14] 10Data-Engineering, 10Data-Services, 10cloud-services-team, 10Epic: Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema - https://phabricator.wikimedia.org/T215858 (10Gehel) Removing DPE SRE from this task until it is picked up by #data-engine... [13:54:31] when you create the chart, you need to state the modules that you want [13:54:31] aka: can I only rely on istio to get traffic to my pod, and not care about reaching out to other services via envoy [13:54:57] 10Data-Engineering, 10Data-Engineering-Jupyter, 10Product-Analytics: Functionality to share & view notebooks - https://phabricator.wikimedia.org/T156934 (10Gehel) Removing DPE SRE until this is picked up by #data-engineering [13:54:59] in theory yes, but you'll need to create ad-hoc network policies etc.. [13:55:16] but for non-http traffic it makes sense [13:55:23] just avoid the "mesh" module, and only use "ingress" [13:55:41] without "mesh" you don't get the envoy sidecar [13:55:56] 10Data-Engineering, 10Research-Freezer: [Open question] Improve bot identification at scale - https://phabricator.wikimedia.org/T138207 (10Gehel) Removing DPE SRE until there is more clarity on what to do and if we need to be involved. [13:56:41] basically with ingress you configure the traffic between the istio gw pods on wikikube and your app [13:56:54] "mesh" takes care of your-app -> services [13:57:02] if you don't want it, you can discard it [13:57:09] Hi btullis - I have seen you're investigating the issue with mediawiki-load airflow job - I kinda know the structure of these stuff, do you wish we spend a minute together? [13:57:09] ok cool [13:57:10] maybe leave a note in the chart's README [13:57:15] so people are aware [14:00:09] 10Data-Engineering, 10Research-Freezer, 10WMF-Legal, 10User-Elukey: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Gehel) Removing DPE SRE until there is a clear direction and we are needed for the implementation. [14:00:10] I was really struggling with differentiating both notions (ingress vs mesh) because both rely on envoy at the core [14:00:53] but I can make the whole "nodeport 30443 --> my app port" proxying work without any mesh enabled, so that clears things up [14:01:59] super [14:02:14] if you need other info ping me, otherwise you can drop a line in #wikimedia-k8s-sig [14:02:28] joal: If you have the time, yes please. Batcave? [14:02:37] btullis: OMW! [14:04:06] 10Data-Platform-SRE: Archive /home/ezachte data on stat1007 - https://phabricator.wikimedia.org/T238243 (10Gehel) p:05Medium→03Low [14:10:12] elukey: thanks so muc [14:10:24] *much, this is really making things clearer for me [14:16:01] 10Data-Engineering, 10Data-Platform-SRE, 10SRE, 10observability, and 3 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10brouberol) > While we ought to consider an upgrade for all 4 clusters, from what I understand Jumbo can be upgraded independently. Are there any concerns... [14:16:55] !log killed a stalled sqoop process on an-launcher1002 [14:16:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:17:27] btullis if you have about a minute, could you have a look at https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/8? If that looks good, I should be able to publish the image, use it for the spark-history server and make sure the HDFS permissions are up to snuff. Thanks! [14:17:54] hello! I noticed that "PROBLEM - Mediawiki CirrusSearch Saneitizer Weekly Fix Rate" just alerted - it lines up with me migrating the cirrusSearchCheckerJob to the kubernetes jobrunners and I was wondering if there's a way that I could debug what I might have broken? [14:18:05] and also if it's important enough that I should roll back the job asap [14:18:21] dcausse: ^ [14:18:27] !log killed a stalled sqoop process on an-launcher1002 [14:18:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:18:29] hnowlan: looking [14:18:31] inflatador: ^^^ [14:18:49] rolling back is very low impact fwiw [14:19:35] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10brouberol) [14:19:59] hnowlan: saneitizer is low impact anyway, it checks through all wikis to see if anything has been missed in indexing, over an 8 weeks period [14:20:49] hnowlan: when did you switch this job to k8s? [14:21:30] dcausse: around 1700 yesterday [14:22:16] run duration and backlog length have increased a bit since https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&refresh=5m&var-dc=codfw%20prometheus%2Fk8s&var-job=cirrusSearchCheckerJob [14:23:01] from what I can see from the alert graph the job runs for longer with a lower qps but I don't know if that's hiding some failure case [14:24:19] it does seem to fix only cloudelastic (check cloudelastic.fixed) tho... (https://grafana-rw.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=1699280584910&to=1701872584910&viewPanel=35) which lines up with your deploy as well [14:25:14] and we're running a test on cloudelastic so might be unrelated to what you did, but still seems suspicious that the times perfectly match... [14:25:31] hnowlan: I think we can keep it as is time for us to understand what's going on there [14:30:14] ohh I hadn't noticed that it was specific to cloudelastic.fixed. I'll see if there's any outlier config related to it [14:30:57] dcausse: okay, cool - thanks. Keep me posted, I'll try to keep looking into it. lemme know if I can change anything [14:31:38] 10Data-Engineering, 10Data-Platform-SRE, 10SRE, 10observability, and 3 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10elukey) >>! In T300102#9386753, @brouberol wrote: > >> Specifically are there clients that publish to Kafka Jumbo directly or do all Kafka topics get mi... [14:32:20] brouberol: added some comments to --^ lemme know if it makes sense (cc: btullis too) [14:33:19] I was just reading, thanks for the clarification and added context 👍 [14:33:42] I can't think about other big use cases [14:40:36] 10Data-Engineering, 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10SRE: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYMOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10herron) Thank you for looking into this @brouberol. Yes I think you are right about the anonymous ACLs. I thi... [14:49:58] 10Data-Platform-SRE (23/24 Q2 Milestone 1): Reduce impact of Elastic snapshots - https://phabricator.wikimedia.org/T351475 (10bking) Update: Traffic team [[ https://gerrit.wikimedia.org/r/c/operations/alerts/+/980871 | merged a patch that makes these LVS high RX alerts non-paging ]] . Thus, I believe we don't be... [14:50:23] 10Data-Platform-SRE (23/24 Q2 Milestone 1): Reduce impact of Elastic snapshots - https://phabricator.wikimedia.org/T351475 (10bking) 05Open→03Resolved a:03bking [15:02:22] brouberol: I will check asap. I'm a little discombimulated by a big Ops Week investigation, but will get onto the spark history image MR asap. [15:03:51] 10Data-Platform-SRE: Consider implementing Envoy for Elastic hosts - https://phabricator.wikimedia.org/T352872 (10bking) [15:05:33] no worries at all [15:05:35] thank you [15:17:29] (DiskSpace) firing: Disk space an-test-ui1001:9100:/ 3.068% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-ui1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [15:31:52] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator - https://phabricator.wikimedia.org/T350784 (10bking) [15:33:09] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cephosd2001.codfw.wmnet with OS bullseye [15:33:14] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cephosd2001.codfw.wmnet with OS bullseye executed with errors: - ce... [15:43:34] 10Data-Platform-SRE: Troubleshoot recurring systemd unit failures for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 (10bking) [15:53:18] 10Data-Engineering, 10Data-Platform-SRE: Update the sqoop configuration for mediawiki to obtain linktarget from the production replicas, instead of wikireplicas - https://phabricator.wikimedia.org/T352879 (10BTullis) [15:53:58] 10Data-Engineering, 10Data-Platform-SRE: Update the sqoop configuration for mediawiki to obtain linktarget from the production replicas, instead of wikireplicas - https://phabricator.wikimedia.org/T352879 (10BTullis) p:05Triage→03High [16:31:50] 10Data-Engineering, 10Data-Platform-SRE: Update the sqoop configuration for mediawiki to obtain linktarget from the production replicas, instead of wikireplicas - https://phabricator.wikimedia.org/T352879 (10Milimetric) Sqooping from the production replicas would mean applying the same sanitization rules on ou... [16:32:06] Amir1: if you're around, the querypage job is blocked on this ^ [16:32:28] I'm around, let me take a look [16:33:33] 10Data-Platform-SRE, 10observability, 10Epic, 10Patch-For-Review: Change data platform-related IRC channels to improve communication - https://phabricator.wikimedia.org/T352783 (10bking) [16:33:57] milimetric: it's for PII protection, imagine someone adds [[Real name of User:Foo is Foo Bar]] to wikis, now we have a linktarget row with that text [16:34:09] so if removed, then in wikireplicas, it won't show up [16:35:06] but also, do the analytics sqoop need only absolutely public info? Isn't the whole thing private? [16:35:24] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: Run a spark job in test to make sure the history server can see the job data - https://phabricator.wikimedia.org/T352882 (10brouberol) [16:36:44] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Run a spark job in test to make sure the history server can see the job data - https://phabricator.wikimedia.org/T352882 (10brouberol) [16:37:36] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [16:38:35] 10Data-Engineering, 10Data-Platform-SRE: Update the sqoop configuration for mediawiki to obtain linktarget from the production replicas, instead of wikireplicas - https://phabricator.wikimedia.org/T352879 (10Ladsgroup) Repeating from IRC for the sake of documentation: > it's for PII protection, imagine someon... [16:40:56] Amir1: could we instead just delete that linktarget row when the source of it is removed from wikitext? [16:41:18] the analytics sqoop output is put into two folders, one private, one public [16:41:40] we could move this to the private folder, that'd be fine, we'd have to move some things around [16:41:41] it might be used in another table, there is a maint script that does that regularly [16:41:48] but I think it's not wired to run automatically yet [16:41:51] but we shouldn't put it in the public folder if it's possibly not [16:42:16] ah, the garbage collection problem, ok [16:42:26] yup, fun aspect of normalization [16:43:13] how hard is to move it to private? [16:43:21] and then make it read from production [16:43:41] also, you keep snapshot of every month, that's technically already there I think :D [16:49:21] 10Data-Engineering, 10Data-Platform-SRE: Update the sqoop configuration for mediawiki to obtain linktarget from the production replicas, instead of wikireplicas - https://phabricator.wikimedia.org/T352879 (10Ladsgroup) Alternatively, just move it to the private one and read from production. [17:19:21] !log deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/979118 for airflow metrice update to airflow_test instance for T349532 [17:19:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:19:24] T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs - https://phabricator.wikimedia.org/T349532 [17:23:21] 10Data-Engineering, 10Movement-Insights, 10Traffic, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10elukey) I created a tmux session on cp4038 with the following: ` sudo varnishlog -n frontend -q 'ReqHeader:Sec-Purpose eq "prefetch;... [17:26:26] 10Data-Engineering, 10Movement-Insights, 10Traffic, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10elukey) Confirmed, I found a request with: ` - ReqHeader sec-purpose: prefetch;anonymous-client-ip ` So it is a good confirma... [17:34:46] joal: I have a code change ready if you want to proceed with the extra json field :D [17:36:43] I think we'll try to make that appear in x_analytics elukey, I'll try to understand code :) [17:36:53] 10Data-Engineering, 10Movement-Insights, 10Traffic, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10elukey) If the final decision is to proceed with a new field in Webrequest, https://gerrit.wikimedia.org/r/c/operations/puppet/+/98091... [17:37:06] joal: ack if you want something different the patch is ready [17:37:23] I also verified that we do receive the header etc.. [17:37:23] Thanks so much elukey :) [17:37:26] np! [18:08:10] 10Data-Platform-SRE (23/24 Q2 Milestone 1): Check home/HDFS leftovers of ryanmax - https://phabricator.wikimedia.org/T325527 (10BTullis) 05Open→03Resolved Thanks all for your input. I have now removed the files with: ` sudo cumin 'C:profile::analytics::cluster::client or C:profile::hadoop::master or C:profil... [18:09:50] 10Data-Platform-SRE: Consider implementing Envoy for Elastic hosts - https://phabricator.wikimedia.org/T352872 (10bking) Looks like we started working on envoy for Elastic in [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/838182 | this patch from late 2022 ]] . See also T143553 . [18:14:34] hnowlan: we suspect some firewall/egress issues between these new jobrunners and cloudelastic [18:18:34] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cephosd2001.codfw.wmnet with OS bullseye executed with errors: - ceph... [18:26:45] 10Data-Platform-SRE (23/24 Q2 Milestone 1): Alert in need of triage: SmartNotHealthy (instance an-worker1086:9100) - https://phabricator.wikimedia.org/T352168 (10BTullis) The disk has been replaced. Now following procedures outlined here: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoo... [18:27:03] !log restarted hadoop-yarn-nodemanager and hadoop-hdfs-datanode services on an-worker1086 for T352168 [18:27:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:27:06] T352168: Alert in need of triage: SmartNotHealthy (instance an-worker1086:9100) - https://phabricator.wikimedia.org/T352168 [18:28:08] 10Data-Platform-SRE (23/24 Q2 Milestone 1): Alert in need of triage: SmartNotHealthy (instance an-worker1086:9100) - https://phabricator.wikimedia.org/T352168 (10BTullis) 05Open→03Resolved [18:32:18] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Observability-Metrics, 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10BTullis) I think that we might be able to call this part done, as far as the SRE side is con... [18:55:29] 10Data-Platform-SRE: Troubleshoot recurring systemd unit failures for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 (10ops-monitoring-bot) Host rebooted by bking@cumin2002 with reason: None [18:55:54] 10Data-Platform-SRE: Troubleshoot recurring systemd unit failures for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 (10bking) [18:56:16] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/39 Update the vers... [18:56:24] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/39 Update the version of wmfdata-python used in... [19:03:14] 10Data-Engineering, 10Movement-Insights, 10Traffic, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10dr0ptp4kt) I like where @elukey is going with this. I saw this ticket and thought I'd share some perspective. It may be this has all... [19:17:29] (DiskSpace) firing: Disk space an-test-ui1001:9100:/ 3.065% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-ui1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:19:18] (03CR) 10Sharvaniharan: [C: 03+1] "Looks good to me" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/974674 (https://phabricator.wikimedia.org/T351292) (owner: 10Clare Ming) [19:40:51] (03CR) 10Cooltey: [C: 03+1] "looks good to me" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/974674 (https://phabricator.wikimedia.org/T351292) (owner: 10Clare Ming) [19:42:34] dcausse: sounds about right, they're probably a bit stricter. That said, they should be using the same service mesh ports/configs as the appservers. Would the old jobrunners be directly connecting to the cloudelastic hosts in some cases? [19:51:30] 10Data-Engineering (Sprint 6): [Data Quality] Move MetricsExporter to refinery-spark - https://phabricator.wikimedia.org/T352688 (10gmodena) a:03gmodena [20:05:20] 10Data-Engineering (Sprint 6): [Data Quality] Move MetricsExporter to refinery-spark - https://phabricator.wikimedia.org/T352688 (10gmodena) This class should be come the API boundary for pipeline implementers and the Data Quality Metrics table. In the example below I renamed the `MetricsExporter` to `DeequAnaly... [20:19:10] (03PS1) 10Milimetric: Sanitize the linktarget sqoop [analytics/refinery] - 10https://gerrit.wikimedia.org/r/980937 (https://phabricator.wikimedia.org/T352879) [20:35:30] 10Data-Engineering, 10Data-Platform-SRE, 10Patch-For-Review: Update the sqoop configuration for mediawiki to obtain linktarget from the production replicas, instead of wikireplicas - https://phabricator.wikimedia.org/T352879 (10CodeReviewBot) milimetric opened https://gitlab.wikimedia.org/repos/data-engineer... [20:36:17] (03CR) 10Sharvaniharan: [C: 03+2] "Merging this.. thank you @cjming :-)" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/974674 (https://phabricator.wikimedia.org/T351292) (owner: 10Clare Ming) [20:38:04] (03Merged) 10jenkins-bot: Add custom schemas for 2 Android article instruments [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/974674 (https://phabricator.wikimedia.org/T351292) (owner: 10Clare Ming) [20:53:52] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Papaul) @Jhancock.wm this is failing because site.pp and apt_repo.yaml still have the old host names [20:56:29] 10Data-Engineering (Sprint 6): [Data Quality] Adopt iceberg as the data quality metrics table backend - https://phabricator.wikimedia.org/T352687 (10gmodena) f/up from a slack thread with @JAllemandou of how to partition the metrics table. Modulo naming adjustments, the following should be a good starting point:... [21:33:38] 10Data-Engineering, 10Movement-Insights: Canonical-data ownership, definition and update - https://phabricator.wikimedia.org/T339928 (10JAnstee_WMF) p:05Triage→03High [21:35:04] 10Data-Engineering, 10Movement-Insights: Canonical-data ownership, definition and update - https://phabricator.wikimedia.org/T339928 (10JAnstee_WMF) p:05High→03Medium [21:54:38] (03PS1) 10Milimetric: Explain hidden country data [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/980949 (https://phabricator.wikimedia.org/T333716) [21:59:43] (03PS1) 10Milimetric: Release 2.10.3 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/980952 [22:00:08] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Release 2.10.3 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/980952 (owner: 10Milimetric) [22:00:23] (03CR) 10Milimetric: [C: 03+2] Explain hidden country data [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/980949 (https://phabricator.wikimedia.org/T333716) (owner: 10Milimetric) [22:05:46] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Patch-For-Review, 10Russian-Sites: "Active editors by country" doesn't display numbers for Belarus, Kazakhstan, Russia - https://phabricator.wikimedia.org/T333716 (10Milimetric) The above patches do what I suggested in a comment on the talk page: https://... [22:44:14] 10Data-Platform-SRE, 10Wikidata, 10Epic: Review alerts for miscweb-hosted domains commons-query.wikimedia.org and query.wikidata.org - https://phabricator.wikimedia.org/T352921 (10bking) [22:45:39] 10Data-Platform-SRE, 10Wikidata, 10Epic: Review alerts for miscweb-hosted domains commons-query.wikimedia.org and query.wikidata.org - https://phabricator.wikimedia.org/T352921 (10bking) [22:47:14] 10Data-Platform-SRE, 10Wikidata, 10collaboration-services, 10Epic: Review alerts for miscweb-hosted domains commons-query.wikimedia.org and query.wikidata.org - https://phabricator.wikimedia.org/T352921 (10Dzahn) [23:17:29] (DiskSpace) firing: Disk space an-test-ui1001:9100:/ 3.061% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-ui1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace