[04:15:51] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[04:21:40] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10WolfgangFahl) What a great application of Postel's law https://en.wikipedia.org/wiki/Robustness_principle
[05:17:58] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:52:18] <wikibugs>	 10Data-Platform-SRE, 10Research, 10WMDE-TechWish-Maintenance-2023: Publish dump scraper reports - https://phabricator.wikimedia.org/T341751 (10awight)
[07:52:45] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10dcausse) 05Resolved→03Open @Hannah_Bast thanks for making such a change! I did a quick test locally a...
[07:52:50] <wikibugs>	 10Data-Platform-SRE, 10Research, 10WMDE-TechWish-Maintenance-2023: Publish dump scraper reports - https://phabricator.wikimedia.org/T341751 (10awight) The data and metadata are published--the final step is to announce on the research-l mailing list.
[08:07:11] <wikibugs>	 10Data-Platform-SRE, 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) Tagging #data-platform-sre to get answers to Andrew's questions in T346948#9184440.
[08:16:06] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[08:33:15] <btullis>	 Morning all. I'm going to make that patch to bump the HDFS nameserver heap allocation, as we discussed a couple of days ago. 
[09:17:58] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:21:45] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:22:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:36:10] <btullis>	 I have finished the write-up of yesterday's incident on Wikitech and set it to in-review: https://wikitech.wikimedia.org/wiki/Incidents/2023-09-27_Kafka-jumbo_mirror-makers - Please feel free to review and/or amend.
[09:38:13] <brouberol>	 I'm looking for someone who'd be willing to pair on the reassignment of kafka partitions away from kafka-jumbo10[01->06].eqiad.wmnet. The idea would be to use topicmappr to compute a reassignment plan, and run it in several steps, to avoid overloading the servers
[09:41:25] <brouberol>	 One thing I'm especially interested in knowing is whether we usually reassign as fast as possible, or do we use a throttle. And if so, what throttle value do we commonly use?
[09:41:35] <btullis>	 I'd be happy to pair, but I've never done it before so I would likely be learning from you. You're welcome to seek a more qualified volunteer. There is some prior art here, if that helps. https://phabricator.wikimedia.org/T341558
[09:43:33] <btullis>	 I believe that we tend to err on the side of caution and exercise patience.
[09:44:03] <btullis>	 Also: https://gitlab.wikimedia.org/elukey/kafka_main_rebalance/-/tree/main/main-codfw
[09:44:44] <brouberol>	 Thanks! elukey, would you be interested in joining as well?
[09:45:07] <brouberol>	 (given that you were involved in https://phabricator.wikimedia.org/T341558, as btullis pointed out)
[09:45:39] <elukey>	 brouberol: two suggestions: 1) use a gitlab repo to commit the plan and the rollbacks, so in case something happens mid-way we know how to pick up 2) I used some throttling IIRC, but the idea was to start from something low and maybe increase once you see that the cluster is fine
[09:47:30] <brouberol>	 gotcha! I'd be interested in knowing what we consider to be "low" here. I remember some d2 AWS instances that would cap at 40MB/s for example, so low would be 10MB/s, or 500MB/s on some  GCP instances with some beefy network links
[10:00:31] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10BTullis) 05Open→03Resolved Thanks @Jclark-ctr - but everything is OK now with this server so no further action is required at the moment. It looks like the RAID controller must have had a...
[10:02:34] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: [INCIDENT] kafka-jumbo mirrormaker from main-eqiad crashes associated with RecordTooLargeException errors - https://phabricator.wikimedia.org/T347481 (10BTullis) 05Open→03Resolved I've set the Wikitech page status to in-review and shared the link. There are few acti...
[10:06:37] <elukey>	 brouberol: we have 10G cards on all jumbo nodes so I don't expect issues, but we could cap to something like 50MB/s as starter
[10:07:08] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10SRE Observability: Install a Prometheus connector for Presto, pointed at thanos-query - https://phabricator.wikimedia.org/T347430 (10JAllemandou) Adding this to our radar as well, to keep an eye when we start querying.
[10:07:35] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10SRE Observability, 10Data Engineering and Event Platform Team (Sprint 2): Install a Prometheus connector for Presto, pointed at thanos-query - https://phabricator.wikimedia.org/T347430 (10JAllemandou)
[10:16:09] <brouberol>	 elukey: thanks
[10:17:07] <btullis>	 Is this 50 MB/s here? https://gitlab.wikimedia.org/elukey/kafka_main_rebalance/-/blob/main/main-codfw/executor.sh#L7
[10:18:32] <elukey>	 btullis: I don't recall the unit for throttle, but kafka main has 1G interfaces, so I went probably more conservative
[10:18:53] <elukey>	 it is bytes
[10:19:19] <elukey>	 so I used 50MBps indeed
[10:19:30] <elukey>	 probably a bit too much, but it worked nicely
[10:20:00] <btullis>	 Ah, I didn't know about the 1G cards in kafka-main. Thanks elukey.
[10:20:05] <btullis>	  It looks like we might need to make some updates to this section of the docs while we're working on this ticket: https://wikitech.wikimedia.org/wiki/Kafka/Administration#Rebalance_topic_partitions_to_new_brokers
[10:21:45] <btullis>	 I'm planning to run an `sre.hadoop.roll-restart-masters` cookbook to pick up the new heap settings, unless anyone suggest otherwise.
[10:26:41] <wikibugs>	 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team, 10Patch-For-Review: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) Here is the source directory on datadumps1006. It is owned by `dumpsgen|dumpsgen` and...
[10:27:47] <btullis>	 !log roll-restarting hadoop namenodes to pick up new heap settings.
[10:27:49] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:37:13] <elukey>	 btullis: re https://gerrit.wikimedia.org/r/c/operations/puppet/+/961698 (sorry didn't review it in time) - not sure if 4GB are enough, but we can always bump (maybe 10/12 would give extra headroom, assuming we have space). The alert is kinda outdated I think, we can also review/drop it in case
[10:37:39] <elukey>	 (I checked the upstream numbers and they don't match what we already have IIUC)
[10:38:51] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks
[10:39:18] <brouberol>	 btullis elukey: here's an example of how to evacuate data away from a set of brokers with minimal impact on the producer/consumers https://phabricator.wikimedia.org/P52716
[10:41:39] <brouberol>	 the idea is do do ^ this ^ on batches  of topics, with a throttle
[10:42:53] <elukey>	 makes sense yes
[10:45:51] <btullis>	 elukey: We don't have enough headroom on the madoop masters' total RAM  for me to feel comfortable adding more than 4GB. The O/S cache is already squeezed down to about 16 GB https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=an-master1001&var-datasource=thanos&var-cluster=analytics&viewPanel=4
[10:50:14] <btullis>	 brouberol: Yes, that plan looks good to me too. Should I take it that topicmappr benefits from being in a screen/tmux session? It doesn't return until the reassignment is complete?
[10:52:01] <btullis>	 Damn! The failback from an-master1002 to an-master1001 failed again.
[10:52:05] <btullis>	 https://www.irccloud.com/pastebin/9n5yW12h/
[10:54:58] <btullis>	 !log sudo systemctl start hadoop-hdfs-namenode.service on an-master1001 after cookbook failback failure
[10:56:43] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:58:24] <brouberol>	 btullis Topicmappr generates its output in a couple of secs, and the reassign partition command as well. The actual work of moving data around is then done by Kafka itself. So no real need to run in a screen 
[10:58:51] <btullis>	 brouberol: ack, thanks.
[11:01:37] <brouberol>	 Basically it’s in charge of computing the smartest plan possible given a set of constraints, but only that. 
[11:07:36] <btullis>	 Oh I see, so you just use `kafka reassign-partitions --reassignment-json-file <insert tasty json file here> --execute` to fire and forget?
[11:07:51] <btullis>	 Then the `--verify` to check on progress?
[11:08:22] <brouberol>	 Exactly 
[11:08:46] <btullis>	 😎
[11:16:00] <wikibugs>	 10Data-Platform-SRE, 10Research, 10WMDE-TechWish-Maintenance-2023: Publish dump scraper reports - https://phabricator.wikimedia.org/T341751 (10awight) a:05awight→03None
[11:22:24] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Ensure mjolnir can work on Python 3.9 or later - https://phabricator.wikimedia.org/T346373 (10BTullis) >>! In T346373#9200159, @EBernhardson wrote: > This needs an updated version of conda, attempts to update the python version currently result in the dep...
[11:30:22] <wikibugs>	 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team, 10Patch-For-Review: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) Oh, that didn't work. The owner and group of `clouddumps1002:/srv/dumps/xmldatadumps/p...
[11:40:51] <jinxer-wm>	 (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[13:07:28] <jinxer-wm>	 (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ...
[13:07:28] <jinxer-wm>	 mw_page_content_change_enrich in eqiad is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning
[13:20:03] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10elukey) In https://horizon.wikimedia.org/project/instances/beed7e0d-4e7f-446f-a73c-60dce7ecff4f/ I see the config fo...
[13:22:58] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service Failed on kafka-jumbo1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:27:57] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10elukey) First attempt not good:  ` Sep 28 13:26:01 deployment-eventstreams-2 docker-eventstreams[1093001]: {"name":"...
[13:29:36] <wikibugs>	 10Data-Platform-SRE: Troubleshoot mw-page-content-change-enrich and flink-operator - https://phabricator.wikimedia.org/T347521 (10tchin) @bking Gabriele is currently on sick leave but yes let's try incrementing the helm chart version
[13:33:58] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) This is the cluster initial state.{F37829340}
[13:50:25] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` brouberol@kafka-jumbo1010:~/topicmappr$ topicmappr rebuild  --topics '^[a-d].*$' --brokers '1007,1008,1009,1010,1011,1012,1013,1014,1015' --skip-no-ops  --optimize-leadership --phased-reassignment --...
[13:55:28] <brouberol>	 !log started the evacuation of a subset of topics away from kafka-10[01-06].eqiad.wmnet T336044
[13:59:11] <elukey>	 brouberol: if anything blows up and you are afk, where do we find rollback jsons ?
[14:00:55] <brouberol>	 I have committed them in https://phabricator.wikimedia.org/T336044#9206801
[14:01:10] <brouberol>	 this ^ is the cluster initial state
[14:01:46] <brouberol>	 I have massaged it so that you can directly feed it to kafka reassign-partitions
[14:03:43] <brouberol>	 ah, I forgot that kafka sees reassignments as under-replicated partitions. Should i put a downtime on the cluster?
[14:06:57] <jinxer-wm>	 (MediawikiPageContentChangeEnrichJobManagerNotRunning) resolved: ...
[14:06:57] <jinxer-wm>	 mw_page_content_change_enrich in eqiad is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning
[14:11:20] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) I started with a 50MB/s overall throttle, as a conservative value: ` kafka reassign-partitions --reassignment-json-file out-files/reassignment-a-to-d-phase0.json --throttle 5000000 --execute `
[14:14:25] <btullis>	 brouberol, you can put a silence on this specific alert from alerts.wikimedia.org
[14:14:29] <btullis>	 https://usercontent.irccloud-cdn.com/file/H5bFSq8N/image.png
[14:15:02] <btullis>	 https://wikitech.wikimedia.org/wiki/Alertmanager#Silences_&_acknowledgements
[14:16:29] <brouberol>	 done, thanks!
[14:22:03] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) I bumped the throttle a bit as the cluster isn't displaying any strain: ` kafka reassign-partitions --reassignment-json-file out-files/reassignment-a-to-d-phase0.json --throttle 8000000 --execute `
[14:26:56] <wikibugs>	 10Data-Platform-SRE: Troubleshoot mw-page-content-change-enrich and flink-operator - https://phabricator.wikimedia.org/T347521 (10bking) 05Open→03Resolved Per IRC conversation with @dcausse , the application was in a partially-deployed state (he was able to find this via `kubectl get networkpolicy`).  Destro...
[14:33:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ...
[14:33:27] <jinxer-wm>	 mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning
[14:33:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ...
[14:33:33] <jinxer-wm>	 The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning
[14:39:06] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks
[14:45:03] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` kafka reassign-partitions --reassignment-json-file out-files/reassignment-a-to-d-phase0.json --throttle 13000000 --execute `
[15:03:49] <joal>	 heya btullis - any idea on the corrupt block report alerts we've seen today?
[15:05:54] <btullis>	 Hi joal - I'm assuming it's related to the failover of the namenodes, from an-master1001 to an-master1002.
[15:06:25] <joal>	 it would indeed make sense :) Today's failover was for restart for memory bumps, right?
[15:06:29] <joal>	 btullis: --^
[15:06:42] <btullis>	 The failback failed again, so I'm waiting for a quiet time to attempt the failback again.
[15:07:05] <joal>	 :(
[15:07:21] <btullis>	 Yes that's right. Adding another 4GB of heap. I would like to have added more, but we don't have the free RAM in the nameservers.
[15:07:25] <joal>	 We should find a way 
[15:07:59] <joal>	 (not about RAM, about restarts)
[15:08:08] <joal>	 I wish we reduce file number
[15:08:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichJobManagerNotRunning) resolved: ...
[15:08:27] <jinxer-wm>	 mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning
[15:08:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichTaskManagerNotRunning) resolved: ...
[15:08:33] <jinxer-wm>	 The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning
[15:08:50] <btullis>	 Yes. reducing the number of files would be good. 
[15:09:12] <joal>	 wow - flink errors as well :(
[15:09:17] <joal>	 mwarf
[15:09:26] <btullis>	 I was thinking of waiting until :50 past the hour for the failback, but maybe I should attempt it sooner.
[15:09:56] <btullis>	 g.modena is out today as well :(
[15:10:13] <joal>	 Normally :50 is a good timne - but in the afternoon there is more user usage as well
[15:10:30] <joal>	 and, there is more and more cluster usage at large
[15:10:34] <joal>	 We need to find a way
[15:10:46] <joal>	 possibly set the system in read-only mode for the restart time
[15:12:15] <btullis>	 We have two new nameservers being installed now for refresh of an-master100[12] - but sadly they have the same amount of RAM - 128 GB each.
[15:13:39] <btullis>	 I feel I should perhaps have bumped the specs at the time of ordering, but it hadn't been growing so much as it has been recently at the time I needed to confirm the order.
[15:13:40] <joal>	 could we ask for a bump (to 256 for instance) before we put them in service?
[15:13:55] <btullis>	 I will ask. 
[15:15:22] <joal>	 thanks so much btullis 
[15:20:28] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` kafka reassign-partitions --reassignment-json-file out-files/reassignment-a-to-d-phase0.json --throttle 18000000 --execute `
[15:32:16] <btullis>	 joal: https://phabricator.wikimedia.org/T342291#9207501
[15:32:30] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10BTullis) @RobH I'm sorry to have to be a pain, but is there any chance that we can increase the RAM in these two servers, before they come into ser...
[15:33:14] <joal>	 Thanks so much again btullis <3
[15:33:44] <btullis>	 joal: You're suggesting setting into safe mode before the failback? https://phabricator.wikimedia.org/T342291#9207501 What impact is this likely to have on running jobs?
[15:34:01] <joal>	 btullis: it'll make jobs fail : (
[15:34:39] <btullis>	 Yeah, thought so. I'd like to try one more failback without that first, if I can.
[15:47:42] <btullis>	 Attempting the failback any minute now...
[15:50:45] <btullis>	 !log failed back namenode services from an-master1002 to an-master1001
[15:50:47] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:50:51] <btullis>	 https://www.irccloud.com/pastebin/g90xsXLO/
[15:51:05] <btullis>	 Phew!
[15:54:25] <brouberol>	 I'll be out of here in about 10 minutes. The reassignment on kafka-jumbo.eqiad is still ongoing, with a throttle currently set at 250MB/s. I'll extend the silence until tomorrow 10am CEST (a bit after I'm back online)
[15:54:44] <joal>	 \o/ btullis 
[15:56:11] <brouberol>	 Should any issue occur, feel free to reduce the throttle by running kafka reassign-partitions --reassignment-json-file /home/brouberol/topicmappr/out-files/reassignment-a-to-d-phase0.json --throttle 25000000 --execute on kafka-jumbo1010.eqiad.wmnet
[15:56:11] <brouberol>	 Should any issue occur, feel free to reduce the throttle by running kafka reassign-partitions --reassignment-json-file /home/brouberol/topicmappr/out-files/reassignment-a-to-d-phase0.json --throttle 25000000 --execute on kafka-jumbo1010.eqiad.wmnet
[15:57:33] <wikibugs>	 10Data-Platform-SRE, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10bking)
[15:57:43] <brouberol>	 (reducing the throttle value as necessary)
[15:58:05] <wikibugs>	 10Data-Platform-SRE, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10bking)
[16:00:58] <wikibugs>	 10Data-Platform-SRE, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10bking)
[16:09:45] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10RobH)
[16:12:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ...
[16:12:27] <jinxer-wm>	 mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning
[16:13:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ...
[16:13:27] <jinxer-wm>	 The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning
[16:19:57] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10RobH) >>! In T342291#9207501, @BTullis wrote: > @RobH I'm sorry to have to be a pain, but is there any chance that we can increase the RAM in these...
[16:28:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichTaskManagerNotRunning) resolved: ...
[16:28:27] <jinxer-wm>	 The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning
[16:32:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichJobManagerNotRunning) resolved: ...
[16:32:27] <jinxer-wm>	 mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning
[16:35:40] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10BTullis) 05Open→03Resolved I have removed all previous conda envinronments on stat1009 and now i...
[16:38:19] <btullis>	 !log rebooting eventlog1003 for T344671
[16:38:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:42:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) nagios-nrpe-server.service Failed on eventlog1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:46:29] <brouberol>	 I've reduced the throttle back to 50MB/s as mw-page-content-change-enrich is suffering
[16:47:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) nagios-nrpe-server.service Failed on eventlog1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:48:45] <btullis>	 brouberol: Many thanks.
[16:49:46] <btullis>	 ^ these eventlog1003 alerts should go away by themselves. The service is up and running now by itself. 
[16:49:51] <jinxer-wm>	 (HdfsFSImageAge) firing: The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[16:50:26] <btullis>	 ^ This Hadoop fsimage alert should also go away by itself. We tend to get these after failovers.
[16:54:51] <jinxer-wm>	 (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[16:56:20] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Products, 10Wikidata, 10Wikidata-Query-Service: Publish WDQS JNL files to dumps.wikimedia.org - https://phabricator.wikimedia.org/T344905 (10bking)
[16:58:01] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10bking)
[16:59:51] <jinxer-wm>	 (HdfsFSImageAge) resolved: (2) The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[17:14:57] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e...
[17:23:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ...
[17:23:27] <jinxer-wm>	 mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning
[17:24:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ...
[17:24:27] <jinxer-wm>	 The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning
[17:26:50] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: mw-page-content-change-enrich not checkpoint - https://phabricator.wikimedia.org/T347615 (10tchin)
[17:27:11] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: mw-page-content-change-enrich not checkpointing - https://phabricator.wikimedia.org/T347615 (10tchin)
[17:30:13] <wikibugs>	 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Platform Team Initiatives (New Hook System): Update EventLogging to use the new HookContainer/HookRunner system - https://phabricator.wikimedia.org/T346540 (10Umherirrender) a:05Umherirrender→03None
[17:43:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichJobManagerNotRunning) resolved: ...
[17:43:27] <jinxer-wm>	 mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning
[17:44:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichTaskManagerNotRunning) resolved: ...
[17:44:27] <jinxer-wm>	 The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning
[17:59:37] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: mw-page-content-change-enrich not checkpointing - https://phabricator.wikimedia.org/T347615 (10tchin)
[18:04:46] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: mw-page-content-change-enrich not checkpointing - https://phabricator.wikimedia.org/T347615 (10tchin) Unaligned checkpoints didn't work. Maybe it's because of data being moved around to new brok...
[18:10:01] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad...
[18:11:05] <wikibugs>	 10Data-Platform-SRE: Write new partman recipe for cloudelastic - https://phabricator.wikimedia.org/T342463 (10bking) The new recipe still errs with message`Failed to load ldlinux.c32`. That doesn't sound like a partitioning problem. Will attempt a firmware update and get back.
[18:24:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ...
[18:24:27] <jinxer-wm>	 mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning
[18:26:36] <joal>	 I just tried to silence this --^ if it fies again it's because I've not done it correctly :S (lack of experience...) - tchin, I let you triple check please :)
[18:26:39] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Products, 10Wikidata, 10Wikidata-Query-Service: Publish WDQS JNL files to dumps.wikimedia.org - https://phabricator.wikimedia.org/T344905 (10bking) @dr0ptp4kt and I were looking at this today and it occurred to me that the JNL file is uncompressed.   Thus,...
[18:39:06] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks
[19:04:43] <wikibugs>	 10Data-Platform-SRE: Refactor sre.wdqs.data-transfer to use new spicerack class api - https://phabricator.wikimedia.org/T347624 (10RKemper)
[19:24:27] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e...
[19:24:36] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad...
[19:28:14] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e...
[20:02:55] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad...
[20:03:12] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e...
[20:03:19] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad...
[20:47:51] <jinxer-wm>	 (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength
[20:47:58] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service Failed on kafka-jumbo1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:52:51] <jinxer-wm>	 (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength
[20:55:30] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e...
[20:55:36] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad...
[20:59:08] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e...
[21:02:58] <wikibugs>	 (03PS12) 10Aqu: Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862)
[21:14:05] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad...
[21:53:00] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e...
[22:01:41] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) Hello DC Ops,  I'm still getting PXE boot failures on `cloudelastic1007` .  I've upgraded/downgraded to the...
[22:01:51] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) a:05bking→03None
[22:02:54] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad...
[22:10:35] <wikibugs>	 (03PS10) 10Clare Ming: Add Metrics Platform fragments by platform only [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557)
[22:11:06] <wikibugs>	 (03PS15) 10Clare Ming: Add analytics/metrics_platform/{app,web}/{click,view} schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx)
[22:39:06] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks