[00:15:44] 10Data-Engineering: Check home/HDFS leftovers of paramd - https://phabricator.wikimedia.org/T340427 (10diego) Hi! The standard archival process works good. Thanks! [00:16:42] (SystemdUnitFailed) firing: hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:18:36] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:32] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:42] (SystemdUnitFailed) resolved: hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:29:05] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` brouberol@kafka-jumbo1010:~/topicmappr$ kafka reassign-partitions --reassignment-json-file ./webrequest_upload-phase4.json --execute --throttle 60000000 kafka-reassign-partitions --zookeeper conf1007... [07:16:08] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` brouberol@kafka-jumbo1010:~/topicmappr$ kafka reassign-partitions --reassignment-json-file ./webrequest_upload-phase5.json --execute --throttle 60000000 kafka-reassign-partitions --zookeeper conf1007... [07:26:15] 10Data-Platform-SRE: Upgrade kafka-jumbo100[7-9] to Debian Bullseye - https://phabricator.wikimedia.org/T348495 (10brouberol) [07:26:30] 10Data-Platform-SRE: Upgrade kafka-jumbo100[7-9] to Debian Bullseye - https://phabricator.wikimedia.org/T348495 (10brouberol) [07:26:32] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10brouberol) [07:27:31] 10Data-Platform-SRE: Upgrade kafka-jumbo100[7-9] to Debian Bullseye - https://phabricator.wikimedia.org/T348495 (10brouberol) [07:27:33] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) [07:30:05] (03PS3) 10Joal: Update referer archive job to use icerberg table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/964573 (https://phabricator.wikimedia.org/T347693) [07:53:54] (03PS4) 10Joal: Update referer archive job to use icerberg table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/964573 (https://phabricator.wikimedia.org/T347693) [08:37:15] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Antoine_Quhen) [08:38:30] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Antoine_Quhen) @Jelto done. wikitech username & email address checked. Thanks! [09:31:22] 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team, 10Patch-For-Review: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10jbond) @BTullis i have merged a patch and ran puppet on the two clouddumps host 5 times now and... [09:40:50] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10BTullis) [09:42:45] volans: Thank you! Apologies for the noisy work. We do have a `sre.hadoop.reboot-workers` cookbook, but it only operates on *every* worker in the cluster. In this instance I wanted to reboot a sizeable subset of the workers, so I used a honking great `for` loop to run the reboot-single cookbook multiple times. [09:44:27] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` brouberol@kafka-jumbo1010:~/topicmappr$ kafka reassign-partitions --reassignment-json-file ./webrequest_upload-phase6.json --execute --throttle 60000000 kafka-reassign-partitions --zookeeper conf1007... [09:44:48] btullis: ack, if we move that to the batch classes it would automatically allow you to pass any cumin query to it to work on any subset of hosts. [09:45:51] The problem with the for loop is that is fairly blind of what happens with the others, I hope it was at least checking the exit code of the reboot-single runs ;) [09:53:22] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink - https://phabricator.wikimedia.org/T326914 (10dcausse) all jobs have been restarted to use newer kafka... [09:54:50] hello! Just wondering - for things like pageviews, when does the data update to cassandra run? [09:57:55] High hnowlan - jobs loading run after the full calendar day has passed, so between half past midnigh (more or less) and 2:30am (the pageview loading job takes about 2h) [09:58:07] Hi, not high - sorry :S [09:59:11] joal: thanks! we're trying to dig into why there are these pageview errors every morning https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors?_g=h@bbef750&_a=h@e50185a that timing doesn't line up though so I don't think it's import-related [09:59:33] hnowlan: reading [10:00:07] (the error itself is very baffling) [10:00:34] I can't reopen the URL correctly - could you export it with the "share" functionality please? [10:02:37] joal: https://logstash.wikimedia.org/goto/e21e69d44d71ba4d4434c8c7724a777a [10:02:58] thanks claime [10:03:59] erk, my bad [10:05:30] (03CR) 10Mforns: [C: 03+2] Add analytics/metrics_platform/{app,web}/base schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [10:05:53] (03CR) 10Mforns: [C: 03+1] Add analytics/metrics_platform/{app,web}/base schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [10:06:31] 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team, 10Patch-For-Review: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) Great! Many thanks indeed @jbond - I'll monitor for a day or so, as you suggest. [10:06:54] This is weird claime and hnowlan - the errors are not timely related with any special we'd do on AQS [10:07:19] I was worried you'd say that! :) [10:07:24] :) [10:10:01] And there is nothing special either in terms of traffic on our end [10:11:35] cool, thank you for checking! tbh that error doesn't look like anything that'd come from your end anyway [10:11:44] The spikes we receive that are not cached happen usually between 2am and 4am, just after the new daily data get loaded [10:12:04] So no flag on our side that match the timing [10:12:10] Sorry hnowlan [10:12:16] Good luck with the research :S [10:12:30] ty! :) [10:45:06] Thanks for checking joal <3 [10:45:31] You're welcome claime :) [11:32:01] 10Data-Engineering: Check home/HDFS leftovers of ntsako - https://phabricator.wikimedia.org/T343189 (10WDoranWMF) thanks @BTullis apologies this got lost in email until @Sfaci pointed it out to me. I'll review this week - it looks like the dirs are just polluted with instances of the code he was working on. BUT... [12:00:07] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10rook) 05Resolved→03In progress [12:02:40] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` brouberol@kafka-jumbo1010:~/topicmappr$ kafka reassign-partitions --reassignment-json-file ./webrequest_upload-phase7.json --execute --throttle 80000000 kafka-reassign-partitions --zookeeper conf1007... [12:09:08] 10Quarry, 10Patch-For-Review: investigate quarry on k8s - https://phabricator.wikimedia.org/T301469 (10rook) >>! In T301469#9237091, @Audiodude wrote: > I'm completely new to Kubernetes but have been reading through https://wikitech.wikimedia.org/wiki/Kubernetes/Kubernetes_Workshop. Does WM Cloud provide k8s c... [13:05:47] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10SD0001) Thanks for the details. I had figured out the manual deployment process but had been confused about the role of Puppet – we don't use Puppet at all for this project? >>! In T348184#9236836, @rook wrote: > The primary thin... [13:53:57] 10Data-Engineering, 10Data-Platform-SRE: Upgrade Presto to version 0.283 - https://phabricator.wikimedia.org/T342343 (10BTullis) Beginning work on the new build now, referring to the build for 0.281 T337335 and the [[https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/presto/+/refs/heads/debian/debi... [13:55:38] (03CR) 10Xcollazo: [WIP] Add siteinfo information to output XML (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/963836 (owner: 10Milimetric) [13:55:58] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` kafka-reassign-partitions --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --reassignment-json-file ./webrequest_upload-phase8.json --execute --throttle 80... [13:56:05] 10Data-Engineering, 10Data-Platform-SRE: Upgrade Presto to version 0.283 - https://phabricator.wikimedia.org/T342343 (10BTullis) I pushed both the `master` and `debian` branches to gerrit. Last time I forgot to push the master branch, which resulted in a build failure. ` (base) btullis@marlin:~/wmf/debs/prest... [13:59:13] (03CR) 10Milimetric: [C: 03+2] [WIP] Add siteinfo information to output XML (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/963836 (owner: 10Milimetric) [13:59:27] (03CR) 10Milimetric: [C: 04-2] "oops, just meant to send that comment not +2" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/963836 (owner: 10Milimetric) [14:01:10] (03CR) 10CI reject: [V: 04-1] [WIP] Add siteinfo information to output XML [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/963836 (owner: 10Milimetric) [14:01:35] omg, very stressful stopping a submit job that I accidentally kicked off :P [14:02:12] 10Data-Engineering, 10Data-Platform-SRE: Upgrade Presto to version 0.283 - https://phabricator.wikimedia.org/T342343 (10BTullis) On `build2001` - check out the repository. ` git clone "https://gerrit.wikimedia.org/r/operations/debs/presto" && (cd "presto" && mkdir -p `git rev-parse --git-dir`/hooks/ && curl -L... [14:07:35] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Epic, 10Event-Platform: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10BTullis) [14:08:16] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, 10Patch-For-Review: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10BTullis) 05Open→03Resolved I'm going to resolve this, since the kafka side is d... [14:12:20] (03CR) 10Xcollazo: [C: 03+2] "Re code: LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/963835 (owner: 10Milimetric) [14:21:53] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr) @BTullis did you have any update on Partitioning/Raid section? [14:27:24] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10BTullis) [14:30:30] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10BTullis) >>! In T342454#9239028, @Jclark-ctr wrote: > @BTullis did you have any update on Partitioning/Raid section? Hi @Jclark-ctr - Apologie... [14:59:11] brouberol: I know it's merged already, but I added a question on https://gerrit.wikimedia.org/r/c/operations/puppet/+/963964 for my own education... [15:02:41] (answered) [15:30:00] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) We have migrated all topics but `webrequest_text` which amounts to 20.5TB of data (replication factor included), so about 35% of the cluster size. This will take a bit less than 2 weeks to migrate (in... [15:31:46] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10BTullis) [15:34:01] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Five deleted Wikidata items pertaining to Wikimedia category pages still present in the Query Service - https://phabricator.wikimedia.org/T342593 (10dcausse) Reconciled these items manually,... [15:34:05] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` brouberol@kafka-jumbo1010:~/topicmappr$ topicmappr rebuild --topics webrequest_text --brokers 1015,1014,1013,1012,1011,1010,1009,1008,1007 --chunk-step-size 1 --force-rebuild | grep -v no-op Topics:... [15:34:28] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10BTullis) [15:37:40] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Event-Platform: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676 (10Gehel) 05Open→03Resolved [15:42:13] 10Analytics-Radar, 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, and 3 others: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088 (10Gehel) [15:49:31] (03PS1) 10Joal: Add unique-devices Iceberg schemas and scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/964939 (https://phabricator.wikimedia.org/T347693) [15:52:05] (03CR) 10Ebernhardson: [C: 03+1] rdf_streaming_updater: add emitter_id to side outputs [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/963006 (https://phabricator.wikimedia.org/T347515) (owner: 10DCausse) [16:06:41] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Re-clone dbstore1007:s2 following a crash - https://phabricator.wikimedia.org/T343109 (10BTullis) @Ladsgroup - Would you mind if I check with you the steps required here please? I notice that you said we can't use the `sre.mysql.clone` cookbook, so I'll need to... [16:09:09] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Ahoelzl) Approved. Thanks. [16:14:20] 10Data-Engineering, 10Data-Platform-SRE: Upgrade Presto to version 0.283 - https://phabricator.wikimedia.org/T342343 (10BTullis) Copied the debs to apt1001. ` btullis@apt1001:~$ rsync rsync://build2001.codfw.wmnet/pbuilder-result/bullseye-amd64/presto* . ` Added the debs to the apt-repository with: ` btullis@a... [16:30:34] (03CR) 10Ebernhardson: [C: 03+2] rdf_streaming_updater: add emitter_id to side outputs [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/963006 (https://phabricator.wikimedia.org/T347515) (owner: 10DCausse) [16:31:14] (03Merged) 10jenkins-bot: rdf_streaming_updater: add emitter_id to side outputs [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/963006 (https://phabricator.wikimedia.org/T347515) (owner: 10DCausse) [16:44:29] (03CR) 10Xcollazo: [C: 03+1] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/964573 (https://phabricator.wikimedia.org/T347693) (owner: 10Joal) [16:56:11] (03CR) 10Xcollazo: [WIP] Add siteinfo information to output XML (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/963836 (owner: 10Milimetric) [17:09:59] (03CR) 10Milimetric: [WIP] Add siteinfo information to output XML (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/963836 (owner: 10Milimetric) [17:13:48] (03CR) 10Xcollazo: Add unique-devices Iceberg schemas and scripts (035 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/964939 (https://phabricator.wikimedia.org/T347693) (owner: 10Joal) [17:20:40] (03CR) 10Milimetric: [V: 03+2] "verified this script works with the new table schema, but waiting to merge until deployment" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/963835 (owner: 10Milimetric) [18:58:13] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr) @Papaul I see cloudelasticservers in site.pp it was added by Bking previously node /^cloudelastic1... [19:16:22] (03CR) 10Joal: "Thank you @xcollazo for the quick turnaround in review!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/964573 (https://phabricator.wikimedia.org/T347693) (owner: 10Joal) [19:19:34] (03PS2) 10Joal: Add unique-devices Iceberg schemas and scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/964939 (https://phabricator.wikimedia.org/T347689) [19:25:22] (03CR) 10Joal: "One thread left to discuss :) Thanks again for the speed of the reviews!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/964939 (https://phabricator.wikimedia.org/T347689) (owner: 10Joal) [19:25:56] (03PS3) 10Joal: Add unique-devices Iceberg schemas and scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/964939 (https://phabricator.wikimedia.org/T347689) [19:56:01] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Re-clone dbstore1007:s2 following a crash - https://phabricator.wikimedia.org/T343109 (10Ladsgroup) Generally the steps are correct but don't clone from db1162. That's eqiad's master and serves user traffic, bringing it down will make basically 11 large wikis i... [20:00:32] (03CR) 10Xcollazo: "I just noticed we don't have the Iceberg CREATE statements as part of this CR?" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/964939 (https://phabricator.wikimedia.org/T347689) (owner: 10Joal) [20:04:11] (03PS2) 10Milimetric: Expand mediawiki_project_namespace_map table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/963835 (https://phabricator.wikimedia.org/T348578) [21:55:27] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Papaul) @Jclark-ctr ok then the only thing left is to change it in netbox to use the public VLAN