[00:19:35] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w... [00:39:10] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w... [01:16:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:32] (SystemdUnitFailed) resolved: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:32:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:34:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:46:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:47:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:49:12] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:52:57] (SystemdUnitFailed) resolved: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:23:03] 10Analytics, 10AQS2.0, 10Tech-Docs-Team, 10API Platform (AQS 2.0 Roadmap), and 5 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10Sfaci) [07:00:15] 10Data-Platform-SRE: [DataHub] Users are redirected to the wrong screen on logout and from certain urls. - https://phabricator.wikimedia.org/T347149 (10Stevemunene) Discussions on this are ongoing on datahub slack [[ https://datahubspace.slack.com/archives/CV2UXSE9L/p1696317659529249 | here ]]. However, there ar... [08:33:27] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10elukey) Added https://github.com/wikimedia/eventgate/pull/23 to bump eslint to ES2019 :) [09:00:42] 10Data-Platform-SRE, 10Patch-For-Review: Package kafkakit-prometheus-metricsfetcher as a debian package - https://phabricator.wikimedia.org/T348214 (10CodeReviewBot) brouberol merged https://gitlab.wikimedia.org/repos/sre/kafka-kit/-/merge_requests/3 Drop metricsfetcher from the binaries installed by the kafk... [09:00:48] 10Data-Platform-SRE, 10Patch-For-Review: Package kafkakit-prometheus-metricsfetcher as a debian package - https://phabricator.wikimedia.org/T348214 (10CodeReviewBot) brouberol merged https://gitlab.wikimedia.org/repos/sre/kafkakit-prometheus-metricsfetcher/-/merge_requests/1 Add debian packaging rules [09:20:19] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10Gehel) a:03bking [09:35:53] 10Data-Platform-SRE, 10Patch-For-Review: Package kafkakit-prometheus-metricsfetcher as a debian package - https://phabricator.wikimedia.org/T348214 (10brouberol) ` brouberol@cumin1001:~$ sudo apt-cache search kafka-kit kafka-kit - Kafka topic administration toolkit kafka-kit-prometheus-metricsfetcher - A kafka... [09:36:10] 10Data-Platform-SRE, 10Patch-For-Review: Package kafkakit-prometheus-metricsfetcher as a debian package - https://phabricator.wikimedia.org/T348214 (10brouberol) 05Open→03Resolved [09:43:23] 10Data-Platform-SRE: Instakk kafka-kit-prometheus-metricsfetcher on kafka brokers - https://phabricator.wikimedia.org/T348315 (10brouberol) [09:47:25] thanks for the reviews btullis <3 [09:47:39] A pleasure. [09:47:50] brouberol: re metrics-fetcher package - you rock thanks [09:48:23] ^ seconded [09:48:25] ah, I was about to cc you! It's now available, and ^ should install it on brokers [09:48:27] <2 [09:48:29] oops [09:48:30] <3 [09:48:32] better [09:48:54] <4 [09:49:10] 10Data-Platform-SRE, 10Patch-For-Review: Install kafka-kit-prometheus-metricsfetcher on kafka brokers - https://phabricator.wikimedia.org/T348315 (10brouberol) [09:52:47] so, just to let you know where we are w.r.t the kafka-jumbo100[1-6] decommissioning: I have evacuated *most* of the topics. We're ~down to the webrequest_* topics, which are the chonky ones [09:53:29] Great! Are we leaving these until next week? [09:53:36] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10dcausse) @bking thanks for taking care of this! Something I can't remember if this was done or n... [09:54:07] what I was thinking is: once I'm done with everything but webrequest_*, we could rebalance topics based on size between brokers 1007 -> 1015, and then reassign webrequests* [09:54:48] btullis: yes, I'm planning to babysit the first reassignments, and move 1 partition at a time, to get a feel of how the system reacts, as these are our most active topics [09:55:17] and we're on a friday, so I'd rather everyone spends an un-event-ful (pun intended) weekend [09:56:10] Sounds good to me. What is the expected benefit of rebalancing and then reassigning the last topics, rather than the other way around? [10:00:01] Less data movement. The webrequests topics have many partitions of homogenous size, so assigning them to balanced brokers should not unbalance them [10:00:23] Ack, thanks. [10:44:21] 10Data-Platform-SRE, 10Patch-For-Review: Install kafka-kit-prometheus-metricsfetcher on kafka brokers - https://phabricator.wikimedia.org/T348315 (10brouberol) 05Open→03Resolved [10:44:30] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) After many (many) batches, we're down to the last batch before the `webrequest*` topics: ` brouberol@kafka-jumbo1010:~/topicmappr$ topicmappr rebuild --topics '^[s-v].*$' --brokers '1007,1008,1009,10... [10:46:25] btullis: thinking about this again, I don't t [10:46:50] *think there's any difference between rebalancing then reassigning or doing the opposite [10:47:07] the way we perform these operations mean that they are pretty much commutative [10:47:42] (03PS1) 10Peter Fischer: cirrussearch/update_pipeline/fetch_error use general error_type [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/963990 [10:47:46] (because we'd place a uniform number of werbrequest_* partitions of each broker, of equal size) [10:48:10] Ack. Well, I'm happy either way :-) [11:16:29] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Final batch before starting on the large`webrequest_(text|uploads)` movements. ` brouberol@kafka-jumbo1010:~/topicmappr$ topicmappr rebuild --topics '^(wdqs_streaming_updater_15_test|wdqs_streaming_up... [12:19:39] (03PS18) 10Phuedx: Add analytics/metrics_platform/{app,web}/base schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) [12:21:40] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10dcausse) >>! In T342149#9230691, @dcausse wrote: > @bking thanks for taking care of this! Someth... [12:33:39] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) We have now reassigned every single topic except the 2 largest ones: - `webrequest_text` - `webrequest_upload` As it stands, the brokers 1009-> 1015 are unbalanced in storage, assigned partitions and... [13:01:51] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [13:09:11] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10Ottomata) > Will continue next week. Err, umm, I'm off next week. After that! :) [13:19:59] elukey: I have a small PR for prometheus-metricsfetcher, adding gzip compression of the partition/broker metrics, to circumvent the 1MB znode max size in ZK, if you're interested https://gitlab.wikimedia.org/repos/sre/kafkakit-prometheus-metricsfetcher/-/merge_requests/2 [13:21:22] brouberol: should we propose the change to upstream and then sync in our repo? [13:21:58] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye [13:22:01] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` topicmappr --zk-metrics-prefix kafka/jumbo-eqiad/topicmappr rebalance --topics-exclude '^webrequest_(text|upload)$' --topics '.*' --brokers '1007,1008,1009,1010,1011,1012,1013,1014,1015' --optimize-l... [13:24:07] elukey indeed [13:24:53] that isn't a blocker anyway for us, as I'm still able to proceed cf https://phabricator.wikimedia.org/T336044#9230977 [13:27:30] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) I'm actually going to omit the `--optimize-leadership` flag here, as the same command run without it leads to the same number of leaderships / broker: ` Broker distribution: degree [min/max/avg]: 6/8... [13:47:35] done: https://github.com/tarvip/kafkakit-prometheus-metricsfetcher/pull/3 [13:50:06] super <3 [13:52:05] (mostly FYI as we're waiting for the maintainer to respond) [13:55:17] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye completed:... [14:00:20] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10dcausse) I think the semantic of destroying the flinkdeployment resource is to get rid of the jo... [14:18:49] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 4 others: Merge ks-Arab and ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10Winston_Sung) > Wikidata: > * https://www.wikidata.org/wiki/User:Mr._Ibrahem/Language_statistics_for_items > * https://www.wikida... [14:36:16] btullis: I should be available to pair on Superset starting next week, btw [14:37:37] Great! As it happens, upstream have responded to the bug reports with some useful pointers as well. So we should be unstuck a bit. I haven't updated the ticket yet, but check this out. [14:38:05] https://github.com/apache/superset/issues/25397#issuecomment-1749622944 [14:39:22] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10Jclark-ctr) [14:39:43] https://github.com/apache/superset/issues/23483#issuecomment-1749708460 [14:40:11] So we've got something to work with, rather than a brick wall. [14:44:12] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye completed:... [15:30:53] ah, sqlalchemy, my old friend. [15:35:56] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) I believe that the chain of patches for deploying multiple spark shuffler versions is ready for review. * 963281: Support m... [15:37:31] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10bking) @JMeybohm Is it possible for us to simulate a cluster upgrade (maybe by running [[ https:... [15:38:14] 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team, 10Patch-For-Review: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) Looks good to me too. Thanks again @jbond. I'll check back again next week, once it's... [16:51:26] (03PS2) 10DCausse: rdf_streaming_updater: add emitter_id to side outputs [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/963006 (https://phabricator.wikimedia.org/T347515) [18:06:06] (03CR) 10Clare Ming: Add analytics/metrics_platform/{app,web}/base schemas (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [18:14:57] (03PS19) 10Clare Ming: Add analytics/metrics_platform/{app,web}/base schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [19:28:57] 10Data-Engineering-Radar, 10Privacy Engineering, 10Privacy: Privacy review for dataset publishing (Wikidata topic -> pageview data) - https://phabricator.wikimedia.org/T303304 (10Addshore) 05Open→03Resolved a:03Addshore [19:29:04] 10Data-Engineering-Radar, 10Privacy Engineering, 10Privacy: Privacy review for dataset publishing (Wikidata topic -> pageview data) - https://phabricator.wikimedia.org/T303304 (10Addshore) I'm gonna work on actually publishing this soon :) [19:35:00] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10SD0001) @rook Is there any staging/beta/QA environment as well for quarry? [19:38:16] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10rook) >>! In T348184#9231920, @SD0001 wrote: > @rook Is there any staging/beta/QA environment as well for quarry? A local dev environment can be setup as described in the README under "Setting up a local dev environment" As for... [19:48:04] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10rook) >>! In T348184#9231920, @SD0001 wrote: > ~~@rook Is there any staging/beta/QA environment as well for quarry?~~ Never mind, got it: https://quarry-dev.wmflabs.org/ Oh yeah, quarry-dev. It's not the best representation of qu... [19:56:40] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10Audiodude) FWIW I set up the dev environment without any issue and was able to run queries against mywiki. @rook is it possible to query replicas on toolforge from the dev environment if I use an SSH tunnel and change my config? [19:58:10] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10rook) >>! In T348184#9232082, @Audiodude wrote: > FWIW I set up the dev environment without any issue and was able to run queries against mywiki. > > @rook is it possible to query replicas on toolforge from the dev environment if... [21:18:43] 10Data-Engineering-Radar, 10Privacy Engineering, 10Privacy: Privacy review for dataset publishing (Wikidata topic -> pageview data) - https://phabricator.wikimedia.org/T303304 (10Addshore) https://addshore.com/2023/10/covid-19-wikipedia-pageview-spikes-2019-2022/ [21:18:47] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10SD0001) The networking limitation is overcome with an ssh tunnel. However, the code in replica.py to create the replica host URL is rather weird: ` repl_host = ( f"{self.database_name}.{self.config['REPLICA_DOM... [21:21:02] I finally wrote up my covid 19 topic -> wikidata item -> wikipedia pageviews stuff from years back https://addshore.com/2023/10/covid-19-wikipedia-pageview-spikes-2019-2022/ [21:21:33] Don't think I'l ever have time to turn that topic expansion stuff into something that's actually repeatable easily for multiple topics [23:27:19] addshore: Wow! That's exceptional. Thanks for sharing.