[04:15:51] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [04:21:40] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10WolfgangFahl) What a great application of Postel's law https://en.wikipedia.org/wiki/Robustness_principle [05:17:58] (SystemdUnitFailed) firing: (3) monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:52:18] 10Data-Platform-SRE, 10Research, 10WMDE-TechWish-Maintenance-2023: Publish dump scraper reports - https://phabricator.wikimedia.org/T341751 (10awight) [07:52:45] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10dcausse) 05Resolvedβ†’03Open @Hannah_Bast thanks for making such a change! I did a quick test locally a... [07:52:50] 10Data-Platform-SRE, 10Research, 10WMDE-TechWish-Maintenance-2023: Publish dump scraper reports - https://phabricator.wikimedia.org/T341751 (10awight) The data and metadata are published--the final step is to announce on the research-l mailing list. [08:07:11] 10Data-Platform-SRE, 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) Tagging #data-platform-sre to get answers to Andrew's questions in T346948#9184440. [08:16:06] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [08:33:15] Morning all. I'm going to make that patch to bump the HDFS nameserver heap allocation, as we discussed a couple of days ago. [09:17:58] (SystemdUnitFailed) firing: (3) monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:21:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:43] (SystemdUnitFailed) firing: (3) monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:36:10] I have finished the write-up of yesterday's incident on Wikitech and set it to in-review: https://wikitech.wikimedia.org/wiki/Incidents/2023-09-27_Kafka-jumbo_mirror-makers - Please feel free to review and/or amend. [09:38:13] I'm looking for someone who'd be willing to pair on the reassignment of kafka partitions away from kafka-jumbo10[01->06].eqiad.wmnet. The idea would be to use topicmappr to compute a reassignment plan, and run it in several steps, to avoid overloading the servers [09:41:25] One thing I'm especially interested in knowing is whether we usually reassign as fast as possible, or do we use a throttle. And if so, what throttle value do we commonly use? [09:41:35] I'd be happy to pair, but I've never done it before so I would likely be learning from you. You're welcome to seek a more qualified volunteer. There is some prior art here, if that helps. https://phabricator.wikimedia.org/T341558 [09:43:33] I believe that we tend to err on the side of caution and exercise patience. [09:44:03] Also: https://gitlab.wikimedia.org/elukey/kafka_main_rebalance/-/tree/main/main-codfw [09:44:44] Thanks! elukey, would you be interested in joining as well? [09:45:07] (given that you were involved in https://phabricator.wikimedia.org/T341558, as btullis pointed out) [09:45:39] brouberol: two suggestions: 1) use a gitlab repo to commit the plan and the rollbacks, so in case something happens mid-way we know how to pick up 2) I used some throttling IIRC, but the idea was to start from something low and maybe increase once you see that the cluster is fine [09:47:30] gotcha! I'd be interested in knowing what we consider to be "low" here. I remember some d2 AWS instances that would cap at 40MB/s for example, so low would be 10MB/s, or 500MB/s on some GCP instances with some beefy network links [10:00:31] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10BTullis) 05Openβ†’03Resolved Thanks @Jclark-ctr - but everything is OK now with this server so no further action is required at the moment. It looks like the RAID controller must have had a... [10:02:34] 10Data-Engineering, 10Data-Platform-SRE: [INCIDENT] kafka-jumbo mirrormaker from main-eqiad crashes associated with RecordTooLargeException errors - https://phabricator.wikimedia.org/T347481 (10BTullis) 05Openβ†’03Resolved I've set the Wikitech page status to in-review and shared the link. There are few acti... [10:06:37] brouberol: we have 10G cards on all jumbo nodes so I don't expect issues, but we could cap to something like 50MB/s as starter [10:07:08] 10Data-Engineering, 10Data-Platform-SRE, 10SRE Observability: Install a Prometheus connector for Presto, pointed at thanos-query - https://phabricator.wikimedia.org/T347430 (10JAllemandou) Adding this to our radar as well, to keep an eye when we start querying. [10:07:35] 10Data-Engineering, 10Data-Platform-SRE, 10SRE Observability, 10Data Engineering and Event Platform Team (Sprint 2): Install a Prometheus connector for Presto, pointed at thanos-query - https://phabricator.wikimedia.org/T347430 (10JAllemandou) [10:16:09] elukey: thanks [10:17:07] Is this 50 MB/s here? https://gitlab.wikimedia.org/elukey/kafka_main_rebalance/-/blob/main/main-codfw/executor.sh#L7 [10:18:32] btullis: I don't recall the unit for throttle, but kafka main has 1G interfaces, so I went probably more conservative [10:18:53] it is bytes [10:19:19] so I used 50MBps indeed [10:19:30] probably a bit too much, but it worked nicely [10:20:00] Ah, I didn't know about the 1G cards in kafka-main. Thanks elukey. [10:20:05] It looks like we might need to make some updates to this section of the docs while we're working on this ticket: https://wikitech.wikimedia.org/wiki/Kafka/Administration#Rebalance_topic_partitions_to_new_brokers [10:21:45] I'm planning to run an `sre.hadoop.roll-restart-masters` cookbook to pick up the new heap settings, unless anyone suggest otherwise. [10:26:41] 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team, 10Patch-For-Review: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) Here is the source directory on datadumps1006. It is owned by `dumpsgen|dumpsgen` and... [10:27:47] !log roll-restarting hadoop namenodes to pick up new heap settings. [10:27:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:37:13] btullis: re https://gerrit.wikimedia.org/r/c/operations/puppet/+/961698 (sorry didn't review it in time) - not sure if 4GB are enough, but we can always bump (maybe 10/12 would give extra headroom, assuming we have space). The alert is kinda outdated I think, we can also review/drop it in case [10:37:39] (I checked the upstream numbers and they don't match what we already have IIUC) [10:38:51] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [10:39:18] btullis elukey: here's an example of how to evacuate data away from a set of brokers with minimal impact on the producer/consumers https://phabricator.wikimedia.org/P52716 [10:41:39] the idea is do do ^ this ^ on batches of topics, with a throttle [10:42:53] makes sense yes [10:45:51] elukey: We don't have enough headroom on the madoop masters' total RAM for me to feel comfortable adding more than 4GB. The O/S cache is already squeezed down to about 16 GB https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=an-master1001&var-datasource=thanos&var-cluster=analytics&viewPanel=4 [10:50:14] brouberol: Yes, that plan looks good to me too. Should I take it that topicmappr benefits from being in a screen/tmux session? It doesn't return until the reassignment is complete? [10:52:01] Damn! The failback from an-master1002 to an-master1001 failed again. [10:52:05] https://www.irccloud.com/pastebin/9n5yW12h/ [10:54:58] !log sudo systemctl start hadoop-hdfs-namenode.service on an-master1001 after cookbook failback failure [10:56:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:58:24] btullis Topicmappr generates its output in a couple of secs, and the reassign partition command as well. The actual work of moving data around is then done by Kafka itself. So no real need to run in a screen [10:58:51] brouberol: ack, thanks. [11:01:37] Basically it’s in charge of computing the smartest plan possible given a set of constraints, but only that. [11:07:36] Oh I see, so you just use `kafka reassign-partitions --reassignment-json-file --execute` to fire and forget? [11:07:51] Then the `--verify` to check on progress? [11:08:22] Exactly [11:08:46] 😎 [11:16:00] 10Data-Platform-SRE, 10Research, 10WMDE-TechWish-Maintenance-2023: Publish dump scraper reports - https://phabricator.wikimedia.org/T341751 (10awight) a:05awightβ†’03None [11:22:24] 10Data-Platform-SRE, 10Discovery-Search (Current work): Ensure mjolnir can work on Python 3.9 or later - https://phabricator.wikimedia.org/T346373 (10BTullis) >>! In T346373#9200159, @EBernhardson wrote: > This needs an updated version of conda, attempts to update the python version currently result in the dep... [11:30:22] 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team, 10Patch-For-Review: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) Oh, that didn't work. The owner and group of `clouddumps1002:/srv/dumps/xmldatadumps/p... [11:40:51] (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [13:07:28] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [13:07:28] mw_page_content_change_enrich in eqiad is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [13:20:03] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10elukey) In https://horizon.wikimedia.org/project/instances/beed7e0d-4e7f-446f-a73c-60dce7ecff4f/ I see the config fo... [13:22:58] (SystemdUnitFailed) firing: (2) kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service Failed on kafka-jumbo1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:27:57] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10elukey) First attempt not good: ` Sep 28 13:26:01 deployment-eventstreams-2 docker-eventstreams[1093001]: {"name":"... [13:29:36] 10Data-Platform-SRE: Troubleshoot mw-page-content-change-enrich and flink-operator - https://phabricator.wikimedia.org/T347521 (10tchin) @bking Gabriele is currently on sick leave but yes let's try incrementing the helm chart version [13:33:58] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) This is the cluster initial state.{F37829340} [13:50:25] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` brouberol@kafka-jumbo1010:~/topicmappr$ topicmappr rebuild --topics '^[a-d].*$' --brokers '1007,1008,1009,1010,1011,1012,1013,1014,1015' --skip-no-ops --optimize-leadership --phased-reassignment --... [13:55:28] !log started the evacuation of a subset of topics away from kafka-10[01-06].eqiad.wmnet T336044 [13:59:11] brouberol: if anything blows up and you are afk, where do we find rollback jsons ? [14:00:55] I have committed them in https://phabricator.wikimedia.org/T336044#9206801 [14:01:10] this ^ is the cluster initial state [14:01:46] I have massaged it so that you can directly feed it to kafka reassign-partitions [14:03:43] ah, I forgot that kafka sees reassignments as under-replicated partitions. Should i put a downtime on the cluster? [14:06:57] (MediawikiPageContentChangeEnrichJobManagerNotRunning) resolved: ... [14:06:57] mw_page_content_change_enrich in eqiad is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [14:11:20] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) I started with a 50MB/s overall throttle, as a conservative value: ` kafka reassign-partitions --reassignment-json-file out-files/reassignment-a-to-d-phase0.json --throttle 5000000 --execute ` [14:14:25] brouberol, you can put a silence on this specific alert from alerts.wikimedia.org [14:14:29] https://usercontent.irccloud-cdn.com/file/H5bFSq8N/image.png [14:15:02] https://wikitech.wikimedia.org/wiki/Alertmanager#Silences_&_acknowledgements [14:16:29] done, thanks! [14:22:03] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) I bumped the throttle a bit as the cluster isn't displaying any strain: ` kafka reassign-partitions --reassignment-json-file out-files/reassignment-a-to-d-phase0.json --throttle 8000000 --execute ` [14:26:56] 10Data-Platform-SRE: Troubleshoot mw-page-content-change-enrich and flink-operator - https://phabricator.wikimedia.org/T347521 (10bking) 05Openβ†’03Resolved Per IRC conversation with @dcausse , the application was in a partially-deployed state (he was able to find this via `kubectl get networkpolicy`). Destro... [14:33:27] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [14:33:27] mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [14:33:27] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ... [14:33:33] The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [14:39:06] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [14:45:03] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` kafka reassign-partitions --reassignment-json-file out-files/reassignment-a-to-d-phase0.json --throttle 13000000 --execute ` [15:03:49] heya btullis - any idea on the corrupt block report alerts we've seen today? [15:05:54] Hi joal - I'm assuming it's related to the failover of the namenodes, from an-master1001 to an-master1002. [15:06:25] it would indeed make sense :) Today's failover was for restart for memory bumps, right? [15:06:29] btullis: --^ [15:06:42] The failback failed again, so I'm waiting for a quiet time to attempt the failback again. [15:07:05] :( [15:07:21] Yes that's right. Adding another 4GB of heap. I would like to have added more, but we don't have the free RAM in the nameservers. [15:07:25] We should find a way [15:07:59] (not about RAM, about restarts) [15:08:08] I wish we reduce file number [15:08:27] (MediawikiPageContentChangeEnrichJobManagerNotRunning) resolved: ... [15:08:27] mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [15:08:27] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) resolved: ... [15:08:33] The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [15:08:50] Yes. reducing the number of files would be good. [15:09:12] wow - flink errors as well :( [15:09:17] mwarf [15:09:26] I was thinking of waiting until :50 past the hour for the failback, but maybe I should attempt it sooner. [15:09:56] g.modena is out today as well :( [15:10:13] Normally :50 is a good timne - but in the afternoon there is more user usage as well [15:10:30] and, there is more and more cluster usage at large [15:10:34] We need to find a way [15:10:46] possibly set the system in read-only mode for the restart time [15:12:15] We have two new nameservers being installed now for refresh of an-master100[12] - but sadly they have the same amount of RAM - 128 GB each. [15:13:39] I feel I should perhaps have bumped the specs at the time of ordering, but it hadn't been growing so much as it has been recently at the time I needed to confirm the order. [15:13:40] could we ask for a bump (to 256 for instance) before we put them in service? [15:13:55] I will ask. [15:15:22] thanks so much btullis [15:20:28] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` kafka reassign-partitions --reassignment-json-file out-files/reassignment-a-to-d-phase0.json --throttle 18000000 --execute ` [15:32:16] joal: https://phabricator.wikimedia.org/T342291#9207501 [15:32:30] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10BTullis) @RobH I'm sorry to have to be a pain, but is there any chance that we can increase the RAM in these two servers, before they come into ser... [15:33:14] Thanks so much again btullis <3 [15:33:44] joal: You're suggesting setting into safe mode before the failback? https://phabricator.wikimedia.org/T342291#9207501 What impact is this likely to have on running jobs? [15:34:01] btullis: it'll make jobs fail : ( [15:34:39] Yeah, thought so. I'd like to try one more failback without that first, if I can. [15:47:42] Attempting the failback any minute now... [15:50:45] !log failed back namenode services from an-master1002 to an-master1001 [15:50:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:50:51] https://www.irccloud.com/pastebin/g90xsXLO/ [15:51:05] Phew! [15:54:25] I'll be out of here in about 10 minutes. The reassignment on kafka-jumbo.eqiad is still ongoing, with a throttle currently set at 250MB/s. I'll extend the silence until tomorrow 10am CEST (a bit after I'm back online) [15:54:44] \o/ btullis [15:56:11] Should any issue occur, feel free to reduce the throttle by running kafka reassign-partitions --reassignment-json-file /home/brouberol/topicmappr/out-files/reassignment-a-to-d-phase0.json --throttle 25000000 --execute on kafka-jumbo1010.eqiad.wmnet [15:56:11] Should any issue occur, feel free to reduce the throttle by running kafka reassign-partitions --reassignment-json-file /home/brouberol/topicmappr/out-files/reassignment-a-to-d-phase0.json --throttle 25000000 --execute on kafka-jumbo1010.eqiad.wmnet [15:57:33] 10Data-Platform-SRE, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10bking) [15:57:43] (reducing the throttle value as necessary) [15:58:05] 10Data-Platform-SRE, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10bking) [16:00:58] 10Data-Platform-SRE, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10bking) [16:09:45] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10RobH) [16:12:27] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [16:12:27] mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [16:13:27] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ... [16:13:27] The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [16:19:57] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10RobH) >>! In T342291#9207501, @BTullis wrote: > @RobH I'm sorry to have to be a pain, but is there any chance that we can increase the RAM in these... [16:28:27] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) resolved: ... [16:28:27] The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [16:32:27] (MediawikiPageContentChangeEnrichJobManagerNotRunning) resolved: ... [16:32:27] mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [16:35:40] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10BTullis) 05Openβ†’03Resolved I have removed all previous conda envinronments on stat1009 and now i... [16:38:19] !log rebooting eventlog1003 for T344671 [16:38:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:42:43] (SystemdUnitFailed) firing: (3) nagios-nrpe-server.service Failed on eventlog1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:46:29] I've reduced the throttle back to 50MB/s as mw-page-content-change-enrich is suffering [16:47:43] (SystemdUnitFailed) firing: (3) nagios-nrpe-server.service Failed on eventlog1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:48:45] brouberol: Many thanks. [16:49:46] ^ these eventlog1003 alerts should go away by themselves. The service is up and running now by itself. [16:49:51] (HdfsFSImageAge) firing: The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [16:50:26] ^ This Hadoop fsimage alert should also go away by itself. We tend to get these after failovers. [16:54:51] (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [16:56:20] 10Data-Engineering, 10Data-Platform-SRE, 10Data Products, 10Wikidata, 10Wikidata-Query-Service: Publish WDQS JNL files to dumps.wikimedia.org - https://phabricator.wikimedia.org/T344905 (10bking) [16:58:01] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10bking) [16:59:51] (HdfsFSImageAge) resolved: (2) The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [17:14:57] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [17:23:27] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [17:23:27] mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [17:24:27] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ... [17:24:27] The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [17:26:50] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: mw-page-content-change-enrich not checkpoint - https://phabricator.wikimedia.org/T347615 (10tchin) [17:27:11] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: mw-page-content-change-enrich not checkpointing - https://phabricator.wikimedia.org/T347615 (10tchin) [17:30:13] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Platform Team Initiatives (New Hook System): Update EventLogging to use the new HookContainer/HookRunner system - https://phabricator.wikimedia.org/T346540 (10Umherirrender) a:05Umherirrenderβ†’03None [17:43:27] (MediawikiPageContentChangeEnrichJobManagerNotRunning) resolved: ... [17:43:27] mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [17:44:27] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) resolved: ... [17:44:27] The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [17:59:37] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: mw-page-content-change-enrich not checkpointing - https://phabricator.wikimedia.org/T347615 (10tchin) [18:04:46] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: mw-page-content-change-enrich not checkpointing - https://phabricator.wikimedia.org/T347615 (10tchin) Unaligned checkpoints didn't work. Maybe it's because of data being moved around to new brok... [18:10:01] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [18:11:05] 10Data-Platform-SRE: Write new partman recipe for cloudelastic - https://phabricator.wikimedia.org/T342463 (10bking) The new recipe still errs with message`Failed to load ldlinux.c32`. That doesn't sound like a partitioning problem. Will attempt a firmware update and get back. [18:24:27] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [18:24:27] mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [18:26:36] I just tried to silence this --^ if it fies again it's because I've not done it correctly :S (lack of experience...) - tchin, I let you triple check please :) [18:26:39] 10Data-Engineering, 10Data-Platform-SRE, 10Data Products, 10Wikidata, 10Wikidata-Query-Service: Publish WDQS JNL files to dumps.wikimedia.org - https://phabricator.wikimedia.org/T344905 (10bking) @dr0ptp4kt and I were looking at this today and it occurred to me that the JNL file is uncompressed. Thus,... [18:39:06] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [19:04:43] 10Data-Platform-SRE: Refactor sre.wdqs.data-transfer to use new spicerack class api - https://phabricator.wikimedia.org/T347624 (10RKemper) [19:24:27] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [19:24:36] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [19:28:14] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [20:02:55] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [20:03:12] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [20:03:19] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [20:47:51] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [20:47:58] (SystemdUnitFailed) firing: (2) kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service Failed on kafka-jumbo1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:52:51] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [20:55:30] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [20:55:36] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [20:59:08] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [21:02:58] (03PS12) 10Aqu: Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) [21:14:05] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [21:53:00] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [22:01:41] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) Hello DC Ops, I'm still getting PXE boot failures on `cloudelastic1007` . I've upgraded/downgraded to the... [22:01:51] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) a:05bkingβ†’03None [22:02:54] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [22:10:35] (03PS10) 10Clare Ming: Add Metrics Platform fragments by platform only [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) [22:11:06] (03PS15) 10Clare Ming: Add analytics/metrics_platform/{app,web}/{click,view} schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [22:39:06] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks