[00:23:25] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.371% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:58:05] PROBLEM - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [04:23:25] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.371% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [06:21:55] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 10 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10KartikMistry) [06:23:32] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 10 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10KartikMistry) cxserver finally migrated to Nodejs 18. Please note issues like: https://gerrit.wikimedia.org/r/c/operat... [08:23:25] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.371% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:44:33] 10Data-Platform-SRE: Set up kubeconfig files for spark-history - https://phabricator.wikimedia.org/T351711 (10brouberol) ` brouberol@deploy2002:~$ ls -1 /etc/kubernetes/spark-history* /etc/kubernetes/spark-history-deploy-dse-k8s-eqiad.config /etc/kubernetes/spark-history-dse-k8s-eqiad.config /etc/kubernetes/spa... [08:44:41] 10Data-Platform-SRE: Set up kubeconfig files for spark-history - https://phabricator.wikimedia.org/T351711 (10brouberol) 05Open→03Resolved [08:44:43] 10Data-Platform-SRE: Decide how to handle the spark-history service for the test cluster - https://phabricator.wikimedia.org/T351716 (10brouberol) [08:44:46] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [08:46:35] * brouberol-afk waves good morning [09:03:42] io/ [09:13:05] Morning all. [09:24:49] hi folks! [09:25:00] I took the liberty to move the AMD GPU page under the ML namespace on wikitech [09:25:03] https://wikitech.wikimedia.org/wiki/Machine_Learning/AMD_GPU [09:25:14] there is a redirect from the old Analytics link of course [09:25:19] (so no broken links) [09:25:27] Ah great, thanks. We have no more GPUs in the Hadoop cluster, right? [09:25:42] We do, two (an-worker110[01]) [09:26:05] Ah, thanks. I think I still have to remove the node labels from the other four then. [09:26:39] super yes.. My team may start to experiment with Airflow and Hadoop + GPUs, but somebody will reach out if we start doing so [09:26:55] (to offer something to the Research team while we come up with a more elaborate solution) [09:27:02] (still no idea how it would look like :D) [09:27:39] btullis: could you point me to the pipeline that rebuilt the spark docker images with the new entrypoint that supports the `history` command? Thanks! [09:27:51] Nice :-) Are you managing to get the GPUs nice and warm with lift wing? [09:28:30] brouberol: it's in this repo: https://github.com/wikimedia/operations-docker-images-production-images/blob/master/images/spark/3.1/entrypoint.sh#L94-L104 [09:30:42] thanks! I'm interested in the actual build pipeline, to ultimately get the full image name in our registry. Do you know where I could find that? [09:31:11] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [09:31:28] 10Data-Platform-SRE: Create a helm chart for the spark-history service - https://phabricator.wikimedia.org/T351722 (10brouberol) a:03brouberol [09:33:15] brouberol: It uses an in-house tool called docker-pkg : https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/files/docker/manage-production-images.sh [09:33:41] https://wikitech.wikimedia.org/wiki/Kubernetes/Images#Production_images [09:33:51] https://gerrit.wikimedia.org/g/operations/docker-images/production-images [09:34:12] Oops, that last one was a duplicate link. [09:34:32] I meant to paste this one: https://doc.wikimedia.org/docker-pkg/ [09:39:59] Thansk :) [09:54:56] 10Data-Engineering (Sprint 5), 10Section-Level-Image-Suggestions, 10Structured-Data-Backlog (Current Work): [S] Coalesce section alignment image suggestions output - https://phabricator.wikimedia.org/T347558 (10mfossati) 05Open→03In progress a:03mfossati [10:02:53] 10Data-Engineering (Sprint 5): [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs - https://phabricator.wikimedia.org/T349532 (10CodeReviewBot) aqu merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/537 Add statsd as a dependency to our setup [10:03:03] 10Data-Engineering, 10Release-Engineering-Team, 10GitLab (CI & Job Runners): Unblock Dockerfile syntax to build images with Gitlab trusted runner - https://phabricator.wikimedia.org/T351792 (10CodeReviewBot) aqu merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/537 Add... [10:03:15] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10CodeReviewBot) aqu merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/537 Add statsd as a dependen... [10:31:04] (PuppetFailure) firing: Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:06:16] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Create a keytab for each spark-history-server and add it to the puppet secret hieradata - https://phabricator.wikimedia.org/T351816 (10brouberol) I investigated option 1, by having a look at how we could leverage Calico to NAT our egress traffic through some... [11:07:18] btullis: re GPUs (sorry I just seen the question) - so far yes, they work! We are trying to order one https://www.amd.com/en/products/server-accelerators/instinct-mi100 to test it, but so far it has been difficult (dell/etc.. doesn't have it in their offerings) [11:13:20] elukey: Ack, thanks for the update. [11:51:04] (PuppetFailure) resolved: Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:23:25] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.371% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [13:16:17] (KafkaReplicationFactorTooLow) firing: (197) SKafka topic replication factor is too low - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [13:21:27] (KafkaReplicationFactorTooLow) resolved: (197) SKafka topic replication factor is too low - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [13:27:01] !log reimage druid1007 to upgrade to bullseye T332589 [13:27:02] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1007.eqiad.wmnet with OS bullseye [13:27:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:27:05] T332589: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 [13:38:18] (KafkaReplicationFactorTooLow) firing: (372) SKafka topic replication factor is too low - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [13:43:28] (KafkaReplicationFactorTooLow) resolved: (372) SKafka topic replication factor is too low - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [13:56:57] 10Data-Engineering, 10Data Products: Use inclusive language in code for private analytics infrastructure - https://phabricator.wikimedia.org/T280268 (10lbowmaker) [14:04:03] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1007.eqiad.wmnet with OS bullseye completed: - druid1007 (**WARN**) - Downtimed on Icinga/Aler... [14:05:27] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10Stevemunene) [14:07:50] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10Stevemunene) `druid100[7-8]` are now running bullseye. As stated `druid100[4-6]` are in the process of being decommissioned T336043 and once that is done the whole druid public cluster will be fully r... [14:43:18] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) Looks like the data reload for lexemes completed. @dcausse , are you able to check the data from the reload and make sure it's usable? Let me... [14:53:39] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1160.eqiad.wmnet with OS bullseye [14:57:05] 10Data-Engineering, 10Data Products: Duplicate keys in x_analytics header corrupt some wmf_raw.webrequest rows and break refinement of wmf.webrequest - https://phabricator.wikimedia.org/T351909 (10mforns) [14:57:09] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Create a keytab for each spark-history-server and add it to the puppet secret hieradata - https://phabricator.wikimedia.org/T351816 (10brouberol) We will still test creating a keytab with an fqdn like `...wmnet` to see whether we can... [14:57:18] 10Data-Engineering, 10Data Products (Data Product Sprint 04): Duplicate keys in x_analytics header corrupt some wmf_raw.webrequest rows and break refinement of wmf.webrequest - https://phabricator.wikimedia.org/T351909 (10mforns) [14:59:28] 10Data-Engineering (Sprint 5): [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics - https://phabricator.wikimedia.org/T349763 (10gmodena) Quick update on this spike. Right now I have the following running in a dev environment for `webrequest`: - [x] Scala... [15:02:00] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye [15:02:57] (03CR) 10Sbisson: [C: 03+1] Create mediawiki/wiki_highlights_experiment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/975079 (https://phabricator.wikimedia.org/T348613) (owner: 10Conniecc1) [15:05:51] !log pool druid1007 after bullseye reimage T332589 [15:05:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:05:54] T332589: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 [15:09:27] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10Jclark-ctr) @cmooney thanks for looking at it previously we where not even getting to the Debian installer the sre.network.configure-switch-interface was ran without e... [15:32:18] 10Data-Engineering, 10Data Products (Data Product Sprint 04): Duplicate keys in x_analytics header corrupt some wmf_raw.webrequest rows and break refinement of wmf.webrequest - https://phabricator.wikimedia.org/T351909 (10JAllemandou) Discussed during Data Engineering standup: let's fix with `spark.sql.mapKeyD... [15:48:27] btullis: just in case you're not aware - some airflow jobs fail on the test cluster [15:58:04] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10Jclark-ctr) a:03Jclark-ctr [16:02:52] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10LSobanski) @bking, this change is causing Puppet failures for miscweb1003 because of the existence of duplicate blackbox che... [16:09:27] 10Data-Engineering, 10Data Pipelines, 10Epic: [Iceberg] Epic: Icebergify event_sanitized database - https://phabricator.wikimedia.org/T311743 (10JAllemandou) Relevant slack discussion: https://app.slack.com/client/E012JBDTTHA/CSV483812 We could take advantage of this migration to delete some unused data and... [16:18:31] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Add a spark global config for better file commit strategy - https://phabricator.wikimedia.org/T351388 (10JAllemandou) It's in my plan to update the docs @mpopov, it just takes longer than I would like (like everything else I do lately). [16:23:25] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.371% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:49:47] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye [17:07:48] 10Data-Engineering, 10Observability-Logging, 10Traffic: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10Milimetric) Besides the great discussion above, I just want to point out some related things. * Varnish captures timestamps in a specific way as part of its loggi... [17:08:23] joal: thanks for the heads-up. Will check it out. [18:00:15] 10Data-Platform-SRE: Reduce impact of Elastic snapshots - https://phabricator.wikimedia.org/T351475 (10bking) Just wanted to add that [[ https://phabricator.wikimedia.org/T317616 | Envoy is deployed for Swift frontends ]] per today's SRE meeting. That being said, we (Search Platform/Data Platform SRE) would pr... [18:27:36] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10Papaul) - first issue on 1157 the serial port address was set to COM1 and not com2 - second issue on 1157 boot order was set to network then disk making the server to ke... [18:47:09] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10Dzahn) This should be fixed by just giving it a slightly different name. As is happening in https://gerrit.wikimedia.org/r/c... [18:52:38] (03CR) 10Jforrester: "check experimental" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/974243 (https://phabricator.wikimedia.org/T350411) (owner: 10Lucas Werkmeister (WMDE)) [19:06:10] (03PS1) 10Mforns: Set spark.sql.mapKeyDedupPolicy to LAST_WIN in refine_webrequest [analytics/refinery] - 10https://gerrit.wikimedia.org/r/977774 (https://phabricator.wikimedia.org/T351909) [19:13:22] (03PS2) 10Mforns: Quick fix to refine_webrequest_hourly for exclude_row_ids [analytics/refinery] - 10https://gerrit.wikimedia.org/r/975418 [19:14:24] (03CR) 10Mforns: "Aaargh, messed up including other stuff. Fixing..." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/975418 (owner: 10Mforns) [19:16:46] (03PS3) 10Mforns: Quick fix to refine_webrequest_hourly for exclude_row_ids [analytics/refinery] - 10https://gerrit.wikimedia.org/r/975418 [19:17:54] (03CR) 10Mforns: Quick fix to refine_webrequest_hourly for exclude_row_ids (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/975418 (owner: 10Mforns) [19:18:38] (03PS2) 10Mforns: Set spark.sql.mapKeyDedupPolicy to LAST_WIN in refine_webrequest [analytics/refinery] - 10https://gerrit.wikimedia.org/r/977774 (https://phabricator.wikimedia.org/T351909) [19:56:57] (03PS7) 10Conniecc1: Create mediawiki/wiki_highlights_experiment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/975079 (https://phabricator.wikimedia.org/T348613) [20:23:25] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.371% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:52:21] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye [21:03:08] !log deploying airflow-dags to analytics_test instance [21:03:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:07:53] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [21:42:09] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye [21:56:41] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Observability-Metrics, 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/548... [22:10:39] (03PS8) 10Conniecc1: Create mediawiki/wiki_highlights_experiment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/975079 (https://phabricator.wikimedia.org/T348613) [22:11:05] (03CR) 10CI reject: [V: 04-1] Create mediawiki/wiki_highlights_experiment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/975079 (https://phabricator.wikimedia.org/T348613) (owner: 10Conniecc1) [22:14:19] 10Data-Engineering, 10Diffusion-Repository-Administrators, 10Observability-Metrics, 10Projects-Cleanup, 10Wikimedia-GitHub: Consider archiving Gerrit repository "operations/software/dropwizard-metrics" (20150219) - https://phabricator.wikimedia.org/T352103 (10Aklapper) [22:17:52] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye completed: - an-worker1157 (*... [22:32:24] 10Data-Platform-SRE: Simplify query.wikidata.org LDF endpoint config - https://phabricator.wikimedia.org/T352111 (10bking) [22:50:45] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10bking) Looks like the check targets are rendered at `/srv/prometheus/ops/targets/probes-custom_puppet-http.yaml` on the prom hosts after merging t... [23:05:07] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10bking) The probe is getting a 500 error, which is spawning phab tickets for serviceops-collab team (see T352084 ). As such, I've set a 24-hour sup...