[06:19:26] 10Data-Engineering, 10MediaWiki-Platform-Team, 10User-TheDJ: Remove old origin-when-crossorigin Safari misspelling of referrer policy - https://phabricator.wikimedia.org/T338183 (10Krinkle) 05Open→03Resolved [07:57:55] 10Data-Platform-SRE, 10decommission-hardware: decommission an-test-client1001.eqiad.wmnet - https://phabricator.wikimedia.org/T343520 (10Stevemunene) [07:58:01] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10Stevemunene) [08:24:20] brouberol: I've moved T336041 to "in progress" since it looks like you've started work on it [08:24:21] T336041: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 [08:24:34] good call [08:24:43] thanks! [08:28:39] 10Data-Platform-SRE: Reassign partitions away from kafka-jumbo100[1-6] to kafka-jumbo10[09-15] brokers - https://phabricator.wikimedia.org/T346425 (10brouberol) [08:29:04] 10Data-Platform-SRE, 10Patch-For-Review: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol) [08:29:06] 10Data-Platform-SRE: Reassign partitions away from kafka-jumbo100[1-6] to kafka-jumbo10[09-15] brokers - https://phabricator.wikimedia.org/T346425 (10brouberol) [08:29:51] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) [08:29:53] 10Data-Platform-SRE: Reassign partitions away from kafka-jumbo100[1-6] to kafka-jumbo10[09-15] brokers - https://phabricator.wikimedia.org/T346425 (10brouberol) [08:51:40] 10Data-Engineering, 10Beta-Cluster-Infrastructure: Many kafka errors in beta/deployment-prep - https://phabricator.wikimedia.org/T346402 (10BTullis) I've had a quick look at this too. It strikes me that it's affecting both kafka-main and kafka-jumbo nodes. If I ssh into `deployment-webperf21.deployment-prep.e... [08:59:36] 10Data-Engineering, 10Beta-Cluster-Infrastructure: Many kafka errors in beta/deployment-prep - https://phabricator.wikimedia.org/T346402 (10BTullis) However, it seems that this has been happening for more than two weeks, which is all of the data we have in beta-logs. {F37726861} So if there has been a change t... [09:13:41] 10Data-Platform-SRE: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10Gehel) Looks like we still need to decommission the old VMs [09:25:44] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Allow federated queries with the NLG endpoint (data.nlg.gr) - https://phabricator.wikimedia.org/T337296 (10Gehel) 05Open→03Resolved We are not going to support federation with endpoints that require specific h... [09:27:28] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10Gehel) 05Open→03Resolved a:03Gehel Decision has been documented in https://wikitech.wikimedia.org/wiki/S... [09:27:34] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10Gehel) [09:29:17] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10MoritzMuehlenhoff) >>! In T332570#9166717, @Stevemunene wrote: > Did a powercycle in order to access the terminal, however the host does not accept the root pw. > First thought was to check the partitions from... [09:30:32] 10Data-Platform-SRE, 10Discovery-Search (Current work): Retune enwiki_content shard settings - https://phabricator.wikimedia.org/T343820 (10Gehel) 05Open→03Resolved [09:36:49] 10Data-Engineering, 10Beta-Cluster-Infrastructure: Many kafka errors in beta/deployment-prep - https://phabricator.wikimedia.org/T346402 (10BTullis) I think I have found out what is causing this: The three tools: `navtiming::statsv`, `coal::processor` and `navtiming::webperf` tools are configured always to us... [10:10:24] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineer... [10:39:56] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-libs-HTTP, 10Beta-Cluster-reproducible, and 3 others: PHP Warning: curl_multi_remove_handle(): supplied resource is not a valid cURL Multi Handle resource - https://phabricator.wikimedia.org/T288624 (10Lucas_Werkmeister_WMDE) ^ A s... [10:53:28] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10Stevemunene) Thanks @MoritzMuehlenhoff. I managed to get access to the instance via regular ssh and confirmed that the right volumes exist, which they do ` sda 8:0 0 446.6G... [10:56:14] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1138.eqiad.wmnet with OS bullseye [10:58:58] (03PS1) 10Ladsgroup: Introduce MostTranscludedPages.hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/957899 (https://phabricator.wikimedia.org/T309738) [11:01:18] (03PS2) 10Ladsgroup: Introduce MostTranscludedPages.hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/957899 (https://phabricator.wikimedia.org/T309738) [11:02:56] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Event-Platform: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10gmodena) [11:06:09] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Event-Platform: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10JAllemandou) Continuing my investigation with @gmodena , we found that only the `eventgate-analytics-ext... [11:09:52] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [NEEDS GROOMING][SPIKE] Can we identify indicators to inform an SLO for event emission and intake? - https://phabricator.wikimedia.org/T345195 (10gmodena) Related {T326002} [11:16:59] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10Stevemunene) Following the install via IPMI with `ipmitool -I lanplus -H "an-worker1138.mgmt.eqiad.wmnet" -U root -E sol activate` Reimage seems to have been successful this time round. Waiting for the first p... [11:30:12] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Event-Platform: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10gmodena) >>! In T326002#9169483, @JAllemandou wrote: > Continuing my investigation with @gmodena , we fo... [11:32:32] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [NEEDS GROOMING][SPIKE] Can we identify indicators to inform an SLO for event emission and intake? - https://phabricator.wikimedia.org/T345195 (10gmodena) [11:33:28] (03CR) 10Peter Fischer: "@Gmodena, I followed your advice and moved the schema under development" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse) [11:34:07] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [NEEDS GROOMING][SPIKE] Can we identify indicators to inform an SLO for event emission and intake? - https://phabricator.wikimedia.org/T345195 (10gmodena) [11:34:46] (03CR) 10Peter Fischer: cirrussearch: move to development (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/951829 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [11:37:53] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1138.eqiad.wmnet with OS bullseye completed: - an-worker1138 (**PASS**) - Removed from Puppet and Pu... [11:42:07] btullis: when you have time, would you mind having a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/957861/ ? The plan when the brokers are active and online will be to figure out how to reassign partitions onto them, to evacuate the brokers we'd like to decom. I have a chat w/ elukey setup for next week so we can discuss how to [11:42:07] leverage topicmappr to assist with these operations. [11:44:22] Thanks brouberol. Will do. [11:44:56] no rush really, as I probably won't deploy on a friday afternoon [11:47:28] (03CR) 10Gmodena: [C: 03+2] "LGTM." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/955357 (owner: 10Peter Fischer) [11:48:04] (03Merged) 10jenkins-bot: npm install [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/955357 (owner: 10Peter Fischer) [11:48:07] (03CR) 10Gmodena: [C: 03+1] "LGTM. Feel free to +2 and merge when ready." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/951829 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [11:49:09] (03CR) 10Gmodena: [C: 03+1] "LGTM. Feel free to +2 when ready." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse) [12:11:20] (03CR) 10Joal: "Some comments inline :) It would be interesting before creating those HQL queries to define query-results schemas, in order to see how the" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/957899 (https://phabricator.wikimedia.org/T309738) (owner: 10Ladsgroup) [12:20:23] (03CR) 10Gmodena: [C: 03+2] cirrussearch: move to development [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/951829 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [12:20:43] (03CR) 10Gmodena: [C: 03+2] cirrussearch: add fetch_failure schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse) [12:20:52] (03Merged) 10jenkins-bot: cirrussearch: move to development [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/951829 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [12:21:11] (03Merged) 10jenkins-bot: cirrussearch: add fetch_failure schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse) [12:40:20] 10Data-Platform-SRE: Find/fix logstash logging for rdf-streaming-updater - https://phabricator.wikimedia.org/T345668 (10Gehel) [12:40:27] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10Gehel) [12:50:16] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10MoritzMuehlenhoff) >>! In T332570#9169498, @Stevemunene wrote: > Following the install via IPMI with `ipmitool -I lanplus -H "an-worker1138.mgmt.eqiad.wmnet" -U root -E sol activate` > > Reimage seems to have... [12:50:43] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Event-Platform: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10phuedx) >>! In T326002#9169517, @gmodena wrote: >>>! In T326002#9169483, @JAllemandou wrote: > An entry... [12:51:19] 10Quarry: Create 'reports' feature - https://phabricator.wikimedia.org/T78593 (10Aklapper) [12:51:24] 10Quarry, 10Tool-tsreports: Quarry-TSreports feature parity - https://phabricator.wikimedia.org/T78549 (10Aklapper) [12:51:30] 10Quarry: it would be useful to run the same Quarry query conveniently in several database - https://phabricator.wikimedia.org/T95582 (10Aklapper) [12:52:40] 10Quarry: Provide a way to add hyperlink in Quarry results/output - https://phabricator.wikimedia.org/T74874 (10Aklapper) [12:52:43] 10Quarry: REPORTS-68 Implement dynamic cache duration - https://phabricator.wikimedia.org/T60826 (10Aklapper) [12:52:48] 10Quarry: RSS feeds - https://phabricator.wikimedia.org/T60830 (10Aklapper) 05Open→03Declined Declining as "Tsreports is no longer available. Consider using Quarry instead." [12:52:58] 10Quarry: Quarry-TSreports feature parity - https://phabricator.wikimedia.org/T78549 (10Aklapper) 05Open→03Declined Declining as "Tsreports is no longer available. Consider using Quarry instead." [12:53:05] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10Gehel) [12:53:19] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Alerting: Migrate zookeeper prometheus checks from Icinga to Alertmanager - https://phabricator.wikimedia.org/T309012 (10Gehel) [12:53:21] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10Gehel) [12:53:34] 10Data-Platform-SRE: Confirm TLS certificate monitoring is in place for Search Platform-owned domains - https://phabricator.wikimedia.org/T343761 (10Gehel) [12:53:36] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10Gehel) [12:53:52] 10Data-Platform-SRE: Ensure Data Platform SREs have a contact group in puppet/alerting - https://phabricator.wikimedia.org/T342578 (10Gehel) [12:53:54] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10Gehel) [12:54:08] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10Gehel) [12:54:13] 10Data-Platform-SRE, 10SRE-OnFire, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Review alerting around Wikidata Query Service update pipeline - https://phabricator.wikimedia.org/T336574 (10Gehel) [12:57:53] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineer... [13:04:26] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: eventutilities-python: Gitlab CI pipeline should use memory optimized runners. - https://phabricator.wikimedia.org/T346084 (10CodeReviewBot) gmodena merged https://gitlab.wikimedia.org/repos/dat... [13:44:58] 10Data-Engineering: Codex, Graph, and Wikistats walk into a bar graph - https://phabricator.wikimedia.org/T336544 (10Aklapper) @Milimetric: Do you plan to break this into subtasks? [13:54:37] (03PS1) 10Btullis: Update to Superset version 2.2.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356) [14:02:46] 10Data-Platform-SRE: Service implementation for wdqs202[3-5].codfw.wmnet - https://phabricator.wikimedia.org/T345475 (10bking) [14:08:43] 10Data-Engineering, 10Movement-Insights, 10Product-Analytics, 10Research-Freezer: Investigate relation of UA deprecation to increase in automated traffic and reduction in unique devices - https://phabricator.wikimedia.org/T336715 (10MGerlach) [14:15:18] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10BTullis) I tried a version of conda-analytics where the versions of conda and the conda-libmamba-sol... [14:31:25] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10lbowmaker) [14:41:54] (03PS3) 10Ladsgroup: Introduce MostTranscludedPages.hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/957899 (https://phabricator.wikimedia.org/T309738) [14:48:11] (03CR) 10Ladsgroup: Introduce MostTranscludedPages.hql (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/957899 (https://phabricator.wikimedia.org/T309738) (owner: 10Ladsgroup) [14:49:55] (03CR) 10Ladsgroup: "FWIW, I ran this on fawikiquote" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/957899 (https://phabricator.wikimedia.org/T309738) (owner: 10Ladsgroup) [15:19:24] 10Data-Engineering, 10Data Products: Codex, Graph, and Wikistats walk into a bar graph - https://phabricator.wikimedia.org/T336544 (10Milimetric) When this gets prioritized, we can make it into a proper epic and break it down, but if anybody else wants to take any piece of it, please don't let this stop you. [15:19:26] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10bking) I'm happy to report that the flink-operator is connected to Zookeeper! We can see the znodes now: ls -R /flink/flink-app-wdq... [15:43:01] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10bking) Moving this ticket to done. Operations/chaos engineering testing continues in T342149 . [15:45:32] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10bking) [16:00:17] (03PS1) 10Milimetric: Enable Punjabi language [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/957959 (https://phabricator.wikimedia.org/T344572) [16:20:49] (03PS1) 10Milimetric: Release 2.10.2 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/957960 [16:21:06] (03CR) 10Milimetric: [C: 03+2] Enable Punjabi language [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/957959 (https://phabricator.wikimedia.org/T344572) (owner: 10Milimetric) [16:22:05] (03Abandoned) 10Milimetric: Enable Punjabi language [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/955857 (https://phabricator.wikimedia.org/T344572) (owner: 10Amire80) [16:22:38] (03CR) 10Milimetric: [C: 03+2] Release 2.10.2 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/957960 (owner: 10Milimetric) [16:23:29] (03Merged) 10jenkins-bot: Enable Punjabi language [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/957959 (https://phabricator.wikimedia.org/T344572) (owner: 10Milimetric) [16:24:42] (03Merged) 10jenkins-bot: Release 2.10.2 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/957960 (owner: 10Milimetric) [16:45:26] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10phuedx) I don't mean to tread on anyone's toes here, @lbowmaker. If you... [17:13:20] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10bking) [19:05:58] 10Data-Engineering, 10Movement-Insights, 10Product-Analytics, 10Research-Freezer: Investigate relation of UA deprecation to increase in automated traffic and reduction in unique devices - https://phabricator.wikimedia.org/T336715 (10Mayakp.wiki) >>! In T336715#9132510, @Mayakp.wiki wrote: > > Action Ite... [19:25:07] 10Data-Engineering, 10Movement-Insights: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10Mayakp.wiki) p:05Triage→03Medium [19:25:30] 10Data-Engineering, 10Movement-Insights: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10Mayakp.wiki) - note that we would not want to change any of our existing dimensions (like `agent_type`) to indicate prefetch pageviews since this will break our reportin... [19:33:18] 10Data-Engineering, 10Movement-Insights, 10Product-Analytics, 10Research-Freezer: Investigate relation of UA deprecation to increase in automated traffic and reduction in unique devices - https://phabricator.wikimedia.org/T336715 (10Milimetric) > 1. we need to find a way to tag the prefetch proxy traffic.... [19:38:22] 10Data-Engineering, 10DBA, 10Data-Services, 10TaxonBot, and 2 others: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10BBlack) There's a followup commit that was never merged, to re-enable pybal health monitoring on all the wikireplicas: https://gerrit.wikimedia.org/r/c/operations/pu... [20:07:15] 10Data-Engineering, 10DBA, 10Data-Services, 10TaxonBot, and 2 others: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) >>! In T337446#9170734, @BBlack wrote: > There's a followup commit that was never merged, to re-enable pybal health monitoring on all the wikireplicas: h...