[00:33:39] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [01:29:42] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:31:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:33:33] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [05:29:57] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:17:53] 10Data-Platform-SRE, 10Patch-For-Review: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` brouberol@kafka-jumbo1010:~/topicmappr$ kafka reassign-partitions --reassignment-json-file ./webrequest_text-phase2.json --execute --throttle 60000000 kafka-reassign-partitions... [06:34:22] (HdfsCorruptBlocks) resolved: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [06:48:33] Hi team - I have a power cut this morning, I'll be working offline. [06:48:47] Back in ~3h [07:09:51] (HdfsFSImageAge) firing: The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [07:20:29] PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:21:55] RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:26:15] PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:27:43] RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:37:32] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) [07:37:47] PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:39:51] (HdfsFSImageAge) firing: The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [07:42:05] RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:54:59] PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:57:51] RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:09:21] PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:09:35] I'm eyeing https://phabricator.wikimedia.org/T329398 as something to work on as kafka resassignments are ongoing. Does anyone know how we usually monitor x509 certificate expiration, and make sure certificates are puppetized? [08:13:21] brouberol: I can talk to you about this one in our sync if you like. This skein certificate is a bit of an oddity because we don't manage it at present, it's just auto-created with a 1 year expiry. [08:14:04] 👍 so at least having an alert on its expiration date would be useful [08:14:28] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) The new druid workers have fully joined the cluster and we are ready to move on with the next steps. {F38214901} From the conversations on the last druid refresh, the servers w... [08:15:09] Yes. I'd start with the one on an-test-client1002, which is associated with our analytics-test airflow instance. [08:23:33] The first question I would ask myself is: will I have to use Icinga to check (alert on) the expiration date, or is there a smarter way of doing it with Prometheus these days? [08:28:37] I'm guessing we could rely on exporters such as https://github.com/amimof/node-cert-exporter. I'm not sure whether we could build them to run on Buster though [08:31:29] There are some expiry based alerts in prometheus here: https://codesearch.wmcloud.org/search/?q=expiry&files=&excludeFiles=&repos=operations%2Falerts - I would look to see how those values get into prometheus. [08:32:07] Maybe some are based on probes, or maybe some are based on the textfile collector of the node exporter. [08:32:35] You could also ask in #wikimedia-observability or tag some people from that team and ask in the ticket. [08:34:34] This looks like a probe based check for cassandra TLS expiry: https://gerrit.wikimedia.org/g/operations/puppet/+/a72cec21a8c47d54605de0bcaa50786e0972fc55/modules/cassandra/manifests/instance/monitoring.pp#78 [08:34:39] RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:35:17] ...but I don't that won't work in the case of skein, because there isn't a TCP port using this certificate (I don't think). [08:42:55] PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:51:27] RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:57:13] PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:57:23] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10BTullis) >>! In T336042#9245252, @Stevemunene wrote: > The new druid workers have fully joined the cluster and we are ready to move on with the next steps. Great! > From the conversations... [08:57:36] ^ I'm going to look at this behaviour of an-master1002 [08:58:12] I need to reboot an-master1002 anyway for T344671 [08:58:39] RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:05:53] PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:11:41] RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:13:08] Well, it's not happy. I've tried logging in as root over IPMI but I've not quite got a bash prompt yet. [09:13:11] https://usercontent.irccloud-cdn.com/file/Sqn8Xjy5/image.png [09:13:15] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=an-master1002&var-datasource=thanos&var-cluster=analytics [09:14:33] https://kafka.apache.org/blog#apache_kafka_360_release_announcement - KRaft ready for production! [09:15:59] PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:16:00] elukey: Ooh, interesting. [09:18:17] there is a nice guide to jump from any version to 3.6 [09:18:19] !log power cycling an-master1002 to address unresponsiveness [09:18:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:18:23] that is reassuring for us [09:18:55] Yes. Now we just have to make time to prioritise it :-) [09:20:19] RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:31:55] !log rebooting an-coord1002 for T344671 [09:31:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:34:42] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:51] (HdfsFSImageAge) resolved: The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [09:34:51] (HdfsFSImageAge) resolved: The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [10:09:04] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10elukey) Tried to deploy es-internal in staging, and got: ` {"name":"eventstreams","hostname":... [10:15:32] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) This does make sense, Thanks @BTullis. I shall be doing a string of patches for this [10:34:43] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:01:47] 10Data-Platform-SRE, 10Data-Catalog: Fix the build process for DataHub - https://phabricator.wikimedia.org/T348738 (10BTullis) [11:02:28] 10Data-Platform-SRE, 10Data-Catalog: Fix the build process for DataHub - https://phabricator.wikimedia.org/T348738 (10BTullis) p:05Triage→03High [11:04:17] 10Data-Platform-SRE, 10Data-Catalog: Fix the build process for DataHub - https://phabricator.wikimedia.org/T348738 (10CodeReviewBot) btullis updated https://gitlab.wikimedia.org/repos/data-engineering/datahub/-/merge_requests/4 Configure gradle proxies for trusted runners only [11:24:31] \o/ back online :) [11:24:49] Ah, we've missed you :-) [11:25:58] btullis: tell me, what have I missed :) [11:27:47] joal: an-master1002 jammed up for unknown reason, rebooted now and back to normal. Unusual though. [11:28:10] btullis: was it active Namenode at the time? [11:28:47] No, standby role at the time. No known impact on HDFS. [11:29:30] weird :S [11:29:57] Could it be related to FSimage? a moment when it was either creating the image, or copying it? [11:30:02] anyhow - thanks for fixing! [11:30:36] 10Data-Platform-SRE, 10Data-Catalog: Fix the build process for DataHub - https://phabricator.wikimedia.org/T348738 (10BTullis) I think that I have fixed this now. I used the following syntax for the command options, which I believe works in dash: ` - -Dhttp.proxyHost=${HTTP_PROXY:+webproxy} - -Dhttps.proxyHos... [11:31:51] A pleasure. I haven't investigated the cause further yet. [11:36:59] joal: I have added you as a reviewer to the chain of patches here, since they fiddle with `yarn-site.xml` and node manager config. https://gerrit.wikimedia.org/r/c/operations/puppet/+/963281 [11:37:16] ack btullis - checking [11:37:26] Should be ready to roll out multiple spark shufflers in parallel to the test cluster early next week. [12:00:23] !log pushing out presto version 0.283 to the test cluster. [12:00:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:01:29] 10Data-Engineering, 10Data-Platform-SRE: Upgrade Presto to version 0.283 - https://phabricator.wikimedia.org/T342343 (10BTullis) I'm pushing out the new version of presto to the test cluster with: ` btullis@cumin1001:~$ sudo debdeploy deploy -u 2023-10-12-presto.yaml -Q 'P{O:analytics_test_cluster::presto::ser... [12:04:43] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:25:51] I have a set of 3 xs MRs as prep-work for kafka-jumbo100[1-6] retirement, for anyone interested: https://gerrit.wikimedia.org/r/c/operations/puppet/+/965159 https://gerrit.wikimedia.org/r/c/operations/puppet/+/965160 https://gerrit.wikimedia.org/r/c/operations/puppet/+/965161 [12:28:27] tx for the quick review btullis! Do we have to do anything to get https://gerrit.wikimedia.org/r/c/operations/puppet/+/965160 to apply? eg restart a service manually? Or does puppet handle everything? [12:36:32] joal: I've added you as reviewer on https://gerrit.wikimedia.org/r/c/analytics/refinery/+/965166. Feel free to 301 if you think you're not the right person. Thank you! [12:38:43] brouberol: It looks like puppet will restart karapace on any config file change: https://github.com/wikimedia/operations-puppet/blob/a766af194acb07336e95f47c964f65a2634695c7/modules/karapace/manifests/init.pp#L52 [12:38:52] ...but there's no harm in checking. [12:39:43] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:40:45] oh, sorry. This one I worked out on my own, but I was more worried about the changes to hieradata/role/common/analytics_cluster/launcher.yaml in https://gerrit.wikimedia.org/r/c/operations/puppet/+/965160 [12:42:55] grepping for `datahub_kafka_jumbo` yields no particular result, so that might just be a no-op, but I'm not 100% sure TBH [12:44:43] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:00:17] brouberol: Err, yes this one is a little more complicated. I *think* that the only service impacted will be restarted automatically: https://github.com/wikimedia/operations-puppet/blob/a766af194acb07336e95f47c964f65a2634695c7/modules/airflow/manifests/instance.pp#L423-L434 [13:01:03] ..but you may want to `systemctl status airflow-scheduler@analytics.service` to make sure. [13:01:33] thanks, on it [13:02:01] that'll be on an-launc [13:02:09] Yup [13:02:13] *an-launcher, or every an-* hosts? [13:02:57] Just an-launcher1002, because that's the host where our analytics airflow instance is running: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow/Instances#analytics [13:03:04] on an-launcher: systemctl status airflow-scheduler@analytics.service -> Active: active (running) since Thu 2023-10-12 12:45:10 UTC; 17min ago [13:03:11] seems like you were right! [13:04:35] Great! You can also double-check what puppet logged and did by looking at the last report: https://puppetboard.wikimedia.org/report/an-launcher1002.eqiad.wmnet/461f259564e58631552ed6b85011749591c064cf [13:10:46] oh nice! TIL, thanks [13:18:36] 10Data-Engineering, 10Data-Platform-SRE: Upgrade Presto to version 0.283 - https://phabricator.wikimedia.org/T342343 (10BTullis) I've done some basic tests on the test cluster and things appear to be working properly. ` presto> SELECT node_id,node_version FROM system.runtime.nodes; node_id... [13:19:38] 10Data-Platform-SRE: Reboot apifeatureusage* hosts - https://phabricator.wikimedia.org/T348418 (10bking) 05Open→03Resolved a:03bking [13:20:25] Heads-up, I'm going to reboot archiva1002 in a few minutes. Hopefully it won't affect any deployments. [13:22:58] !log rebooting archiva1002.wikimedia.org for T344671 [13:23:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:24:44] 10Data-Platform-SRE, 10Data-Catalog: Fix the build process for DataHub - https://phabricator.wikimedia.org/T348738 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/datahub/-/merge_requests/4 Configure gradle proxies for trusted runners only [13:28:56] 10Quarry: Remove gerrit git from quarry puppet - https://phabricator.wikimedia.org/T348748 (10rook) [13:29:05] btullis we're in the pairing session if you wanna join [13:30:35] (03CR) 10Joal: [C: 03+1] "LGTM! Let us know when you'd like to see this deployed @brouberol" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/965166 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [13:31:42] btullis: I've +1ed you patches for the spark shuffler thing - let me know when you wish to apply to test, so that we monitor [13:31:42] (03CR) 10Brouberol: Remove kafka-jumbo100[1-6] brokers from bootstrap hosts (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/965166 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [13:31:45] btullis NM, we jumped out...hit me up if you are interested in pairing later though. I don't have too much on my plate [13:32:36] inflatador: Gah! Sorry, I missed your message. I'm still around for the next 30 minutes. [13:45:17] 10Data-Platform-SRE, 10Data-Catalog: Fix the build process for DataHub - https://phabricator.wikimedia.org/T348738 (10BTullis) This worked as hoped. We can see that the `publish` stage worked on the `main` branch, which required trusted runners, and the `build` stage worked on the unprotected feature branch. {... [13:46:14] joal: I think we should merge the two no-op spark shuffler ones today, then we will be ready for the third one to enable on the test cluster next week. [13:47:49] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/1 Create a script for packa... [13:50:48] works for me btullis - Let's merge/deploy/restart today and monitor that no-op is actually no-op :) [13:52:16] Cool, just doing a final pcc run after fixing that `if` syntax suggestion of yours: https://puppet-compiler.wmflabs.org/output/963304/44022/ [13:52:38] btullis: I'll be gone for kids for the next 2 hours, then back [13:53:07] OK, it's a date :-) [13:53:18] :) [13:59:53] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/2 Set the debian/rules file... [14:00:04] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/2 Set the debian/rules file... [14:56:27] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1007.... [14:56:42] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1007.wiki... [14:57:14] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1007.... [15:05:39] 10Data-Engineering, 10serviceops, 10Event-Platform: Traffic for eventstreams-internal seems to be zero for the past months - https://phabricator.wikimedia.org/T348763 (10elukey) [15:14:15] (03PS2) 10Milimetric: Add siteinfo information to output XML [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/963836 (https://phabricator.wikimedia.org/T348761) [15:19:41] 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate recent CirrusSearch p95 latency - https://phabricator.wikimedia.org/T347988 (10bking) [15:31:23] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1007.wiki... [15:33:16] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr) @bking @Papaul I was able to change netbox to Public Vlan redoing most of the steps for setting up... [15:55:42] Here I am :) [15:56:04] btullis: happy to help monitoring (while in meetings) if you wish to apply the patch [15:56:29] joal: Yes, let's do it. [15:58:25] Oh I spotted something that shouldn't be there. Might not make a difference, but I will probably put it back. [15:58:39] https://gerrit.wikimedia.org/r/c/operations/puppet/+/963281/35/modules/profile/manifests/hadoop/worker.pp#b15 [16:12:13] joal: That first one is now merged. Should be a noop. [16:13:53] second patch is now merged. Should also be a noop. [16:14:56] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1009.... [16:15:04] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1010.... [16:15:12] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1008.... [16:19:32] 10Data-Engineering, 10MediaWiki-extensions-EventLogging: Hard-deprecate mw.eventLog.inSample - https://phabricator.wikimedia.org/T348776 (10phuedx) [16:19:52] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10MediaWiki-extensions-WikimediaEvents: Add mw.eventLog.pageviewInSample() method - https://phabricator.wikimedia.org/T348777 (10phuedx) [16:21:40] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) I've deployed the first two patches: * 963281: Support multiple spark yarn shufflers in parallel | https://gerrit.wikimedia... [16:40:05] 10Data-Platform-SRE, 10Patch-For-Review: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` brouberol@kafka-jumbo1010:~/topicmappr$ kafka reassign-partitions --reassignment-json-file ./webrequest_text-phase3.json --execute --throttle 60000000 kafka-reassign-partitions... [16:44:43] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:46:35] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10MediaWiki-extensions-WikimediaEvents: Add mw.eventLog.pageviewInSample() - https://phabricator.wikimedia.org/T348777 (10phuedx) [16:46:43] 10Data-Engineering, 10MediaWiki-extensions-EventLogging: Hard-deprecate mw.eventLog.inSample() - https://phabricator.wikimedia.org/T348776 (10phuedx) [17:04:50] 10Data-Platform-SRE, 10Data-Catalog: Fix the build process for DataHub - https://phabricator.wikimedia.org/T348738 (10BTullis) 05Open→03Resolved [17:12:24] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) I have added the yarn shuffler jars to the apt repository. ` btullis@apt1001:~/spark$ sudo -i reprepro include bullseye-wiki... [17:17:39] 10Quarry: Quarry suggests invalid database names, and doesn't suggest some valid database names - https://phabricator.wikimedia.org/T289943 (10github-toolforge-bot) siddharthvp closed https://github.com/toolforge/quarry/pull/24 [17:18:11] 10Quarry, 10cloud-services-team: Support queries against Quarry's own database and ToolsDB - https://phabricator.wikimedia.org/T151158 (10github-toolforge-bot) siddharthvp closed https://github.com/toolforge/quarry/pull/26 [17:35:07] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1009.wiki... [17:35:11] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1010.wiki... [17:35:19] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1008.wiki... [17:38:50] where does the code for org.wikimedia:eventutilities live? Not having any luck finding it [17:38:55] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3): Update AQS API with September net new content data - https://phabricator.wikimedia.org/T348598 (10Ahoelzl) @lbowmaker to be prioritized for Sprint 4 [17:43:11] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1010.... [17:43:18] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1009.... [17:43:38] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1008.... [17:59:44] (03CR) 10Jdlrobson: [C: 03+2] Refactor schema structure [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965258 (https://phabricator.wikimedia.org/T346106) (owner: 10Kimberly Sarabia) [18:00:18] (03Merged) 10jenkins-bot: Refactor schema structure [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965258 (https://phabricator.wikimedia.org/T346106) (owner: 10Kimberly Sarabia) [18:01:50] ebernhardson: could it be here? https://github.com/nomoa/wikimedia-event-utilities [18:02:30] I found it from the readme here: https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-python [18:03:32] btullis: ahha, thanks! Thats not the right place, but it's a fork of the right one and it contains the SCM information in the pom.xml pointing at the right place. Thanks! [18:03:51] 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10xcollazo) [18:07:06] 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10xcollazo) [18:14:25] ebernhardson: Great! You're welcome. [18:54:11] 10Quarry: git-crypt for config.yaml files - https://phabricator.wikimedia.org/T348476 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/quarry/pull/27 [18:54:36] 10Quarry: git-crypt for config.yaml files - https://phabricator.wikimedia.org/T348476 (10rook) 05Open→03Resolved a:03rook [18:56:05] 10Quarry: update readme with notes on server setup - https://phabricator.wikimedia.org/T348798 (10rook) [18:59:54] 10Quarry: update readme with notes on server setup - https://phabricator.wikimedia.org/T348798 (10rook) https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Quarry#Deployment felt the more appropriate place. [19:00:02] 10Quarry: update readme with notes on server setup - https://phabricator.wikimedia.org/T348798 (10rook) 05Open→03Resolved [19:00:48] 10Quarry: Quarry suggests invalid database names, and doesn't suggest some valid database names - https://phabricator.wikimedia.org/T289943 (10rook) With PR-24 closed, should this task be closed as resolved? [19:01:38] 10Quarry, 10cloud-services-team: Support queries against Quarry's own database and ToolsDB - https://phabricator.wikimedia.org/T151158 (10rook) With PR-26 closed, should this task be closed as resolved? [19:03:21] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1010.wiki... [19:03:33] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1009.wiki... [19:03:41] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1008.wiki... [19:34:32] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3): Update AQS API with September net new content data - https://phabricator.wikimedia.org/T348598 (10nshahquinn-wmf) 05Open→03Resolved Yes, this is resolved! Thanks, everyone 😊 [19:45:36] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1009.... [19:45:49] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1010.... [19:49:43] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:04:43] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:26:08] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1010.wiki... [20:26:13] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1009.wiki... [20:38:55] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1008.... [21:17:20] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) I started a data reload for hosts`wdqs1022-1024`. These are running in a tmux window under my user on`cumin1001`. Based on T323096 , we expe... [21:59:08] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1008.wiki... [22:35:23] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10bking) 05Resolved→03In progress [22:35:31] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking) [22:37:09] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Event-Platform: [SPIKE] Investigate what happens to deployed Flink clusters if the k8s operator goes down? - https://phabricator.wikimedia.org/T346231 (10bking) @gmodena [[ https://phabricator.wikimedia.org/T342149#92286...