[03:40:14] (PuppetFailure) firing: Puppet has failed on an-coord1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:40:14] (PuppetFailure) firing: Puppet has failed on an-coord1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:17:14] Good morning team - btullis, let me know when you have time to talk about the airflow-dags CI issue please :) [08:29:40] !log depool druid10[04-06] T336043 [08:29:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:29:45] T336043: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 [08:37:02] 10Data-Engineering, 10MediaWiki-Vendor, 10PHP 8.2 support, 10Upstream: Use of "self" in callables is deprecated in php8.2 from liuggio/statsd-php-client package - https://phabricator.wikimedia.org/T326386 (10JAllemandou) Thanks for the ping @Jdforrester-WMF. Data-engineering has not been using `statsd` as... [08:39:38] set druid100[4-6] in decommissioning mode from the coordinators UI T336043 [08:39:39] T336043: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 [08:41:44] 10Data-Platform-SRE: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene) The hosts druid100[4-6] have been depooled and set into decommissioning mode {F41562197} [08:42:08] 10Data-Platform-SRE: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene) [09:03:32] (03CR) 10Joal: [C: 03+1] "One typo, one idea, and one question 😊 Tis is mostly good to go IMO." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979406 (https://phabricator.wikimedia.org/T349121) (owner: 10Xcollazo) [09:13:40] volans: Hi! Could I trouble you with a quick re-review of https://gerrit.wikimedia.org/r/c/operations/dns/+/979891/4 when you have a minute? I've addressed your feedback. Thanks! [09:14:18] brouberol: sure, in a few [09:14:31] thank you [09:16:43] {done} [09:16:51] for the part that I can comment on [09:18:14] (03PS2) 10Phuedx: Add readme to product_metrics schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/979407 (owner: 10Clare Ming) [09:18:25] (03CR) 10Phuedx: Add readme to product_metrics schemas (033 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/979407 (owner: 10Clare Ming) [09:24:45] (03CR) 10Phuedx: [C: 03+2] "I was bold and made three three trivial changes. This LGTM!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/979407 (owner: 10Clare Ming) [09:25:17] (03Merged) 10jenkins-bot: Add readme to product_metrics schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/979407 (owner: 10Clare Ming) [09:28:40] joal: Sure will. I have a couple of meetings, so in about an hour from now? [09:28:48] sounds good :) [09:33:17] thank you volans! Just to be extra sure, as this is my first time deploying DNS changes: is this still the right procedure https://wikitech.wikimedia.org/wiki/DNS#Changing_records_in_a_zonefile ? [09:35:02] wow those docs are raelly not user friendly [09:35:16] they explain what it's happening underneath [09:35:19] as a user: [09:35:53] just login on any authoritative dns host (i.e. dns1004.wikimedia.org) [09:36:10] run: sudo authdns-update [09:36:38] EOL [09:36:46] 👍 superb, thanks [09:38:27] brouberol: sorry I think sudo -i might be needed [09:38:50] I always get from history but the hosts have been replaced recently, I lost my bash hisotry :D [09:39:07] but yeah sudo -i is the right way [09:39:17] I seems to have done fine without [09:39:27] > OK - authdns-update successful on all nodes! [09:39:44] great [09:40:01] thanks again for the help [09:41:12] anytime [09:44:07] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Patch-For-Review: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 (10brouberol) DNS records and reverse DNS are in place: ` brouberol@dns1004:~$ for i in 0 1 2; do ns=ns${i}.wikimedia.org; echo $ns; dig +short... [10:04:51] 10Data-Engineering (Sprint 6), 10Event-Platform, 10Patch-For-Review: [Event Platform] mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10CodeReviewBot) gmodena opened https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-python/-/merge_re... [10:09:03] 10Data-Engineering (Sprint 6), 10Event-Platform, 10Patch-For-Review: [Event Platform] mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10CodeReviewBot) gmodena opened https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/mer... [10:39:33] elukey: we've realized that the dse k8s workers didn't have LVS setup in their profile, meaning that any service virtual IP would never be mounted in the loopback. https://gerrit.wikimedia.org/r/c/operations/puppet/+/980347 fixes that (as we need it to setup an LVS service for our ingress gateway). However, this change will also mount the [10:39:33] `inference` service VIP on the dse worker nodes. I wanted to check whether I should also delete the inference pool from the dse config, as I see it in the ml_k8s worker config as well. [10:50:00] joal: I'm ready when you are. Batcave? [10:51:32] brouberol: sorry just seen the ping! [10:51:34] checking [10:51:46] np, I pinged you about 45s ago [10:51:52] +\- [10:52:27] just to know - are you following https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service ? [10:52:34] yes we are [10:53:30] it's just that we realized that our k8s workers didn't include the ::profile::lvs::realserver profile, and adding it would also mount the inference service VIP into our cluster [10:53:41] and I _think_ that service is now running in k8s-ml [10:54:36] yes yes, in theory it shouldn't be mounted though [10:54:55] it is not right now, correct [10:55:25] no ok I mean that IIRC we set the lvs loopback interfaces for a backend host in hiera [10:55:31] but it has been a while, need to check [10:55:39] where did you see inference? Pcc? [10:55:45] or was it a generic question? [10:55:51] but should we merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/980347, puppet would mount it: https://puppet-compiler.wmflabs.org/output/980347/242/dse-k8s-worker1001.eqiad.wmnet/fulldiff.html [10:56:14] I saw it in PCC, as it's still in our config: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/hieradata/role/common/dse_k8s/worker.yaml#20 [10:56:41] brouberol: ah yes, see dse's worker.yaml [10:56:42] profile::lvs::realserver::pools: inference: {} [10:57:00] exactly [10:57:50] It looks like I added it in this commit, which strikes me as a blunder. https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/f86b23fc8697349537cfdd6e0f455c30b5c7e412%5E%21/hieradata/role/common/dse_k8s/worker.yaml [10:57:55] in the wikitech tutorial above there is a section called "Add the IPs on the backend servers" [10:58:14] before being able to do so, you'll need to have an entry in service::catalog [10:58:25] that is where "inference" is defined [10:58:45] once you have the k8s-ingress-dse one, you'll be able to add the right lvs loopback config [10:59:07] so I'd say adding realserver in puppet is too early in the process [10:59:12] Sorry, I wasn't clear. We're indeed adding a service to the service catalog, we have CRs open for that [10:59:30] my question is really about "should we keep that inference pool in our servers?" [10:59:34] *our cluster [11:00:36] nono you'll need to add the k8s-ingress-dse in there [11:00:49] I'd tend to say no, as it's now running in k8s-ml, but I wanted to check, as enabling LVS setup in our workers would mount the VIP of that service in dse, which would result in the VIP being mounted in both the dse and ml cluster [11:01:17] right, but the inference pool was already there before I started the work on k8s-ingress-dse [11:01:21] cf ben's comment [11:01:53] I'd like to know whether I should clean it up, as I don't think it belongs there now [11:01:58] I think the inference reference was due to a copy/paste from the ml-serve cluster, when we bootstrapped dse [11:02:31] Agreed. I think it's fine to remove inference from dse-k8s - especially now that I see it was added in error, not as part of a test. [11:02:45] brouberol: my suggestion is just to set it to the new one when you'll include the profile::realserver stuff to the workers [11:03:05] thanks, that's what I wanted to know 👍 thanks both! [11:08:12] folks totally different subject - is there any plan to upgrade Druid to a more recent version? [11:08:18] 0.19 was released in 2020 :D [11:09:38] we've worked on a biils of maerial type of document https://docs.google.com/spreadsheets/d/1Obj5ozGQYl7Zei0MBLELVD8eDGqqsF_t9T3ZbrOsmZg/edit#gid=1305106294 to figure out our priorities in th [11:09:54] that regard. I can't say when we'll get to it, as it hasn't been decided yet [11:14:21] ah nice TIL [11:29:56] 10Data-Engineering, 10Data-Platform-SRE, 10Data Products, 10Patch-For-Review: [blocker] Airflow unittests failing with TypeError: Pool.create_or_update_pool() - https://phabricator.wikimedia.org/T352577 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/m... [11:30:16] 10Data-Engineering, 10Data-Platform-SRE, 10Data Products, 10Patch-For-Review: [blocker] Airflow unittests failing with TypeError: Pool.create_or_update_pool() - https://phabricator.wikimedia.org/T352577 (10BTullis) @Antoine_Quhen has been working on the problem with the airflow builds and has merged this c... [11:31:45] btullis: Heya - sorry, I missed your ping :S [11:32:12] joal: No worries, it's whenever is convenient for you. [11:33:16] btullis: I'll be teaching this afternoon, so not around :( maybe later this evening, or tomorrow [11:34:01] 10Data-Engineering (Sprint 6): [Data Quality] Adopt iceberg as the data quality metrics table backend - https://phabricator.wikimedia.org/T352687 (10gmodena) a:03gmodena [11:38:33] joal: OK. I'm also happy to chat async if you'd like to jot anything down in the meantime. [11:40:14] (PuppetFailure) firing: Puppet has failed on an-coord1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:18:40] 10Data-Platform-SRE: Decommission kafka-jumo100[1-6] - https://phabricator.wikimedia.org/T352759 (10BTullis) [12:18:52] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T352759 (10BTullis) [12:20:00] (PuppetFailure) resolved: Puppet has failed on an-coord1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:33:11] btullis: haven't we already decomm-ed kafka-jumbo100[1-6]? [12:34:02] Oh yeah, thanks. I got mixed up. [12:34:06] cf https://phabricator.wikimedia.org/T336044 [12:35:44] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T352759 (10BTullis) [12:35:47] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10BTullis) [12:36:36] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T352759 (10BTullis) Accidentally created this duplicate task. Merged and closed. [13:43:57] 10Data-Platform-SRE, 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/1 Add initial files for building superset [13:50:54] 10Data-Engineering, 10Data-Platform-SRE, 10Data Products, 10Patch-For-Review: [blocker] Airflow unittests failing with TypeError: Pool.create_or_update_pool() - https://phabricator.wikimedia.org/T352577 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/m... [14:13:58] (03PS1) 10Btullis: Add the scap targets for the new hadoop coordinators [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/980396 (https://phabricator.wikimedia.org/T336045) [14:15:21] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 10 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Jdforrester-WMF) [14:15:32] I've made this patch to the refinery scap targets: https://gerrit.wikimedia.org/r/c/analytics/refinery/scap/+/980396 - If approved, I will deploy refinery to both of the new an-coord100[3-4] servers this afternoon. [14:20:51] 10Analytics-Radar, 10Data-Engineering, 10Metrics Platform Backlog, 10Event-Platform: Send batches of events from EPC app libraries (Java, Swift) - https://phabricator.wikimedia.org/T239996 (10phuedx) [14:21:08] 10Data-Engineering, 10Movement-Insights: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10mforns) I imagine this requires: 1) Modify the data collection pipeline (probably Varnishkafka and/or Gobblin + wmf_raw.webrequest schema) to collect the Sec-Purpose hea... [14:25:01] 10Data-Engineering, 10Movement-Insights: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10JAllemandou) If we start having data about which webrequest hits are prefetch or not, we definitely would be able to investigate! I'm in favor of moving fast and passing... [14:30:54] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) [14:30:59] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) fixed ceph2002 [14:36:31] 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate performance differences between wdqs2022 and older hosts - https://phabricator.wikimedia.org/T336443 (10Gehel) Loading only a few chunks can be done with loadData.sh -s and -e options (start and end). [14:39:10] (03PS1) 10Btullis: Update list of scap targets to match where hdfs_tools is deployed [analytics/hdfs-tools/deploy] - 10https://gerrit.wikimedia.org/r/980405 (https://phabricator.wikimedia.org/T336045) [14:40:31] (03CR) 10Btullis: "I got the list of servers where we have scap targets for hdfs_tools with:" [analytics/hdfs-tools/deploy] - 10https://gerrit.wikimedia.org/r/980405 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis) [15:05:20] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Papaul) @BTullis where you able to add those nodes to partman-early-command.sh ? [15:06:31] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10BTullis) >>! In T349934#9383342, @Papaul wrote: > @BTullis where you able to add those nodes to partman-early-command.sh ? Oh sorry, I missed the ping. I'll add t... [15:10:14] 10Data-Engineering, 10Movement-Insights: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10WDoranWMF) @JAllemandou how complex are the changes? Is it a quick patch to get in or do we need more discussion? [15:12:13] (DiskSpace) firing: Disk space an-test-ui1001:9100:/ 3.004% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-ui1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [15:13:36] brouberol btullis can one of y'all link me the Phab ticket with the discussion about IRC changes (creating alert channel, etc)? I can't seem to find it [15:13:58] inflatador https://phabricator.wikimedia.org/T346438#9352826 [15:14:28] sorry, I should get this ball rolling, but I've been pretty busy with the spark server and misc k8s-related infrastructure tasks [15:15:04] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10BTullis) >>! In T349934#9383342, @Papaul wrote: > @BTullis where you able to add those nodes to partman-early-command.sh ? Oh, I'm so sorry. I've made a mistake w... [15:16:51] brouberol thanks and no problem, I'm running with the laerts review so will create a task for it [15:21:12] !log I have pushed out version 0.0.25 of conda-analytics to the test cluster. No user facing changes expected. [15:21:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:31:37] 10Data-Engineering (Sprint 6): [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables - https://phabricator.wikimedia.org/T347879 (10Antoine_Quhen) a:03Antoine_Quhen [15:32:57] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10bking) I'm creating a subticket for the IRC suggestions. The VictorOps suggestion is mentioned in T342578 , but we also need to complete T342578 (adding contact g... [15:36:48] (03PS3) 10Xcollazo: Fix recursion for Maps with Structs on SanitizeTransformation [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979406 (https://phabricator.wikimedia.org/T349121) [15:38:49] 10Data-Platform-SRE, 10observability, 10Epic: Change data platform-related IRC channels to improve communication - https://phabricator.wikimedia.org/T352783 (10bking) [15:49:05] (03CR) 10Xcollazo: "Addressing Joal's comments." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979406 (https://phabricator.wikimedia.org/T349121) (owner: 10Xcollazo) [15:50:04] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Patch-For-Review: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 (10brouberol) Thanks to @Clement_Goubert, the `k8s-ingress-dse` LVS service is now deployed. All backends appear down however {F41562919} We're... [16:24:54] 10Data-Platform-SRE, 10observability, 10Epic: Change data platform-related IRC channels to improve communication - https://phabricator.wikimedia.org/T352783 (10Gehel) @bking : you might want to have a look at https://meta.wikimedia.org/wiki/IRC/Bots/ircservserv [16:33:14] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Patch-For-Review: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 (10Clement_Goubert) We ended up rolling back because alerts were persisting even when pooling as inactive. The service was put back in `service_se... [16:37:59] 10Data-Engineering, 10Movement-Insights: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10odimitrijevic) Can the header be translated into an x-analytics value? [16:43:59] 10Data-Platform-SRE, 10Discovery-Search, 10Epic: Cirrus-streaming-updater test: validate relforge indices are correctly updated - https://phabricator.wikimedia.org/T350186 (10bking) @EBernhardson @dcausse based on chatter in #wikimedia-search , it seems like we're already past this point? Like, we're already... [16:44:49] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) I can do this @BTullis. np! [16:46:55] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) [16:55:21] 10Data-Platform-SRE, 10Discovery-Search, 10Epic: Cirrus-streaming-updater test: validate relforge indices are correctly updated - https://phabricator.wikimedia.org/T350186 (10EBernhardson) I've run this a few times, it claims the indices in relforge match the ones in production. I'm still a bit suspicious th... [17:22:32] 10Data-Engineering, 10Movement-Insights: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10JAllemandou) >>! In T346463#9383399, @WDoranWMF wrote: > @JAllemandou how complex are the changes? Is it a quick patch to get in or do we need more discussion? I don't... [17:27:07] (03CR) 10Joal: [C: 03+2] "Thanks Xabriel :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979406 (https://phabricator.wikimedia.org/T349121) (owner: 10Xcollazo) [17:29:20] (03CR) 10Xcollazo: Fix recursion for Maps with Structs on SanitizeTransformation (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979406 (https://phabricator.wikimedia.org/T349121) (owner: 10Xcollazo) [17:29:39] I haven't done an analytics deployment train yet today, but there is nothing in the etherpad either, as far as I can see. [17:30:57] joal: xcollazo: I see that you have a refinery source patch ready to go. Would you like that to go out today, or should it wait? [17:31:22] I'll let xcollazo answer this :) [17:32:02] I have a meeting clash at the same time as the normal deploy window, so I'm only getting to it a little late. [17:32:35] I want Marcel's blessing as well, let me see if he can review sooner. [17:32:55] mforns: ^^ [17:33:17] I've only done a refinery source release once before, so I'll be a bit rusty. [17:33:18] Arf! I gave it a +2, it'll probably merge :S [17:33:25] sorry for that xcollazo [17:35:19] ah, in that case, let's do it. [17:35:43] btullis: I'll ping you when the gate checks are done [17:36:08] Ack [17:37:24] (03Merged) 10jenkins-bot: Fix recursion for Maps with Structs on SanitizeTransformation [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979406 (https://phabricator.wikimedia.org/T349121) (owner: 10Xcollazo) [17:38:49] btullis: ok wikibugs took care of it ^^. I've also added the line item to https://etherpad.wikimedia.org/p/analytics-weekly-train [17:39:53] joal: I believe this one requires a version bump in puppet as well since it is Sanitation Refine? [17:41:39] OK, so I'm following these instructions: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Deploy/Refinery-source#How_to_deploy_with_Jenkins_(and_related_steps) and creating a new release of refinery-source, right? [17:46:50] btullis: right [17:48:39] absolutely right btullis - xcollazo: right as well! [17:48:50] I'm gone to have diner with family, will be back after [17:49:12] (03PS1) 10Btullis: Release version 0.2.27 of refinery source [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/980442 [17:50:57] (03PS2) 10Btullis: Update changelog for v0.2.27 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/980442 [17:53:50] btullis: Added it weekly train doc, but just in case, we also need to bump this when the release is done: https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/manifests/analytics/refinery/job/refine.pp#L39 [17:54:28] Ack, thanks. [17:55:47] No deployment of refinery though? Just refinery source and then puppet? [18:00:43] (03CR) 10Mforns: [C: 03+2] Fix recursion for Maps with Structs on SanitizeTransformation (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979406 (https://phabricator.wikimedia.org/T349121) (owner: 10Xcollazo) [18:01:44] (03CR) 10Btullis: [C: 03+2] Update changelog for v0.2.27 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/980442 (owner: 10Btullis) [18:07:23] 10Data-Platform-SRE: Check home/HDFS leftovers of ryanmax - https://phabricator.wikimedia.org/T325527 (10BTullis) a:03BTullis [18:12:18] 10Data-Engineering, 10Release-Engineering-Team, 10GitLab (CI & Job Runners): Unblock Dockerfile syntax to build images with Gitlab trusted runner - https://phabricator.wikimedia.org/T351792 (10xcollazo) >>! In T351792#9365193, @thcipriani wrote: > Before July this was enforced via kokkuri now it is enforced... [18:12:22] (03Merged) 10jenkins-bot: Update changelog for v0.2.27 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/980442 (owner: 10Btullis) [18:12:50] xcollazo: If you're able to squeeze in that release of wmfdata-python today, that would be great. I can probably then update conda-analytics tomorrow, which helps to unblock a couple of other things. [18:14:40] Starting build #132 for job analytics-refinery-maven-release-docker [18:18:40] 10Data-Engineering, 10Movement-Insights: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10mforns) +1 using x_analytics if possible! [18:20:27] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10RKemper) Alright, I had an initial meeting with Traffic team (Brandon & Valentin). #### Traffic team meeting summary The prim... [18:23:35] * btullis I'm afraid I've run out of time for today, the release of refinery-source is still happening, but I'm going to have to do the update tomorrow. Sorry. [18:26:35] (03CR) 10Xcollazo: Fix recursion for Maps with Structs on SanitizeTransformation (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979406 (https://phabricator.wikimedia.org/T349121) (owner: 10Xcollazo) [18:28:47] > If you're able to squeeze in that release of wmfdata-python today [18:28:47] btullis: ack. will do today! [18:32:48] Project analytics-refinery-maven-release-docker build #132: 09SUCCESS in 18 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/132/ [18:35:25] 10Data-Engineering, 10Movement-Insights: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10WDoranWMF) I'll just add #data-platform-sre to ask for their input - @Gehel is this something you can help with? [19:12:13] (DiskSpace) firing: Disk space an-test-ui1001:9100:/ 3% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-ui1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:32:11] 10Data-Engineering, 10Data-Platform-SRE, 10Data Products: [blocker] Airflow unittests failing with TypeError: Pool.create_or_update_pool() - https://phabricator.wikimedia.org/T352577 (10CodeReviewBot) xcollazo updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/552 Test... [19:54:22] 10Data-Engineering, 10Data-Platform-SRE, 10Data Products: [blocker] Airflow unittests failing with TypeError: Pool.create_or_update_pool() - https://phabricator.wikimedia.org/T352577 (10CodeReviewBot) xcollazo closed https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/552 Test... [19:56:29] 10Data-Engineering, 10Data-Platform-SRE, 10Data Products: [blocker] Airflow unittests failing with TypeError: Pool.create_or_update_pool() - https://phabricator.wikimedia.org/T352577 (10xcollazo) 05Open→03Resolved a:03Antoine_Quhen CI build succeeded on my repro at https://gitlab.wikimedia.org/repos/da... [20:13:17] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Release wmfdata with ca_bundle fix - https://phabricator.wikimedia.org/T352808 (10xcollazo) [20:14:10] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python, 10Data Products (Data Products Sprint 05): Release wmfdata with ca_bundle fix - https://phabricator.wikimedia.org/T352808 (10xcollazo) [20:14:19] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python, 10Data Products (Data Products Sprint 05): Release wmfdata with ca_bundle fix - https://phabricator.wikimedia.org/T352808 (10xcollazo) 05Open→03In progress p:05Triage→03High [20:15:39] 10Data-Platform-SRE: ProbeDown - https://phabricator.wikimedia.org/T352807 (10bking) p:05Triage→03Low [20:19:59] 10Data-Platform-SRE, 10collaboration-services: ProbeDown - https://phabricator.wikimedia.org/T352810 (10bking) 05Open→03Resolved a:03bking [20:22:26] 10Data-Platform-SRE, 10collaboration-services: ProbeDown - https://phabricator.wikimedia.org/T352810 (10bking) 05Resolved→03Invalid Problem with LDF endpoint alerting (see T352807 ) . We still need to figure out how to keep this from pinging ServiceOps Collab team, but I'm closing it in favor of T352807 at... [20:28:13] 10Data-Engineering, 10Data-Platform-SRE, 10Data Products: [blocker] Airflow unittests failing with TypeError: Pool.create_or_update_pool() - https://phabricator.wikimedia.org/T352577 (10JAllemandou) You guys rock <3 [20:56:04] 10Data-Engineering, 10Movement-Insights, 10Traffic: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10WDoranWMF) [21:38:36] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10bking) Reverted the last change after we some alerts for the following hosts: `1008 1009 1010 1011 2008 2014` I suspect this has something to do... [21:43:33] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python, 10Data Products (Data Products Sprint 05): Release wmfdata with ca_bundle fix - https://phabricator.wikimedia.org/T352808 (10xcollazo) Setup test env as follows: ` conda-analytics-activate test_wmfdata_202 source conda-analytics-activate test_wmfda... [21:43:47] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python, 10Data Products (Data Products Sprint 05): Release wmfdata with ca_bundle fix - https://phabricator.wikimedia.org/T352808 (10xcollazo) @BTullis this is done. [21:44:55] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python, 10Data Products (Data Products Sprint 05): Release wmfdata with ca_bundle fix - https://phabricator.wikimedia.org/T352808 (10xcollazo) a:03xcollazo [21:50:40] RECOVERY - MD RAID on aqs1013 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:52:38] 10Data-Platform-SRE: ProbeDown - https://phabricator.wikimedia.org/T352807 (10bking) a:03bking [22:01:17] (KafkaReplicationFactorTooLow) firing: (4) Kafka topic codfw.mediawiki.web_ui_actions replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [22:06:17] (KafkaReplicationFactorTooLow) resolved: (4) Kafka topic codfw.mediawiki.web_ui_actions replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [23:12:13] (DiskSpace) firing: Disk space an-test-ui1001:9100:/ 2.997% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-ui1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace