[00:07:43] (SystemdUnitFailed) firing: (5) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:10:12] (03CR) 10Cooltey: [C: 03+1] New Event schema for mobile apps [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960100 (owner: 10Sharvaniharan) [00:12:43] (SystemdUnitFailed) firing: (5) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:18:49] (03CR) 10Shay Nowick: [C: 03+2] New Event schema for mobile apps (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960100 (owner: 10Sharvaniharan) [00:18:54] RECOVERY - Druid overlord on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [00:19:49] (03Merged) 10jenkins-bot: New Event schema for mobile apps [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960100 (owner: 10Sharvaniharan) [00:22:43] (SystemdUnitFailed) firing: (5) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:42:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:52:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:12:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:21:43] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-libs-HTTP, 10Beta-Cluster-reproducible, and 6 others: PHP Warning: curl_multi_remove_handle(): supplied resource is not a valid cURL Multi Handle resource - https://phabricator.wikimedia.org/T288624 (10tstarling) 05Open→03Resolv... [01:22:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:42:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:52:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:07:44] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:12:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:22:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:27:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:35:06] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [02:37:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:42:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:45:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [02:52:44] (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:12:43] (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:22:43] (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:37:43] (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:42:43] (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:52:43] (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:12:43] (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:22:43] (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:37:43] (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:42:43] (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:52:44] (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:12:43] (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:22:43] (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:42:43] (SystemdUnitFailed) firing: (3) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:52:30] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1bb0fee0-dd0c-4165-86e2-b81abeffa7d2) set by stevemunene@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with... [06:00:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [06:17:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:18:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:06] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [07:35:28] stevemunene: Heya - it'd be great if could silence alerts when reimaging or inirializing hosts - druid1009 is spamming us badly :) [07:38:01] o/ Apologies for the spam joal , I have already ran the sre.hosts.downtime cookbook for druid1009 so we shouldn't expect anymore spam [07:38:14] ack stevemunene - thanks for that [08:02:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:03:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:31:49] (03PS1) 10Gerrit maintenance bot: Add fon.wikipedia to pageview allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/962234 (https://phabricator.wikimedia.org/T347939) [08:32:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:32:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:15] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Now that the incident has been mitigated, I'll resume reassigning partitions away from the kafka-jumbo100[1-6] brokers. ` brouberol@kafka-jumbo1010:~/topicmappr$ topicmappr rebuild --topics '^(-l|-L... [08:40:41] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` brouberol@kafka-jumbo1010:~/topicmappr$ kafka reassign-partitions --reassignment-json-file small-or-empty.json --verify kafka-reassign-partitions --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet... [08:45:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:47:44] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:57:47] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10elukey) >>! In T347477#9216774, @Ottomata wrote: > WOW thank you Luca! Happy to help! I wanted to understand how to... [08:58:05] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Next batch of reassignment: ` $ topicmappr rebuild --topics '^(edisa.mediawiki.job.xxx|edisa.mediawiki.jobrefreshLinks|elukey_druid_test|ema_test_ats|eqcodfw.mediawiki.revision-create|eqcodfw.rc1.medi... [08:59:27] (03PS1) 10DCausse: rdf_streaming_updater: add emitter_id to side outputs [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/963006 (https://phabricator.wikimedia.org/T347515) [09:30:14] btullis: I'm not 100% sure about the leader imbalance metric, but as you can see here https://grafana-rw.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=thanos&var-kafka_cluster=jumbo-eqiad&var-cluster=kafka_jumbo&var-kafka_broker=All&var-disk_device=All&forceLogin=&from=now-7d&to=now&viewPanel=20 the leadership imbalance between all brokers [09:30:14] is being reduced after each reassignment [09:32:32] \o/ [09:33:28] brouberol: I need to introduce you to ottomata - he's back from paternity leave and is our kafka go-to person (before you were there :) - I'm sure you'll have things to talk about :) [09:34:35] oh for sure! I'll book some time [09:35:00] brouberol: on a different topic - we think we have identified why our flink app was not backfilling to a rate we're happy with, and we managed to overcome the issue by adding more parallelization - The not-yet-understood issue is why it broke in the first place, not managing to write to kafka [09:36:21] indeed. This had me stumped as well. If you want, we could try to setup a second app writing to a topic with a small enough retention so that it is a couple of GB, and we them move it around to try to reproduce [09:36:47] but for now, either we haven't had the epiphany idea, or we don't have a metric on the actual bottleneck [09:37:31] do you have metrics on the size of the batch you're attempting to produce? My spidey sense tells me that somehow, this is related to message size (I could 100% be wrong) [09:37:36] let's talk about what we wish to do with the rest of the team before deciding - the trials are not cheap unfortunately and we have many other things to do :) [09:37:54] for sure [09:38:21] I don't have this innformation in mind (batch size), but the weird thing is that the app started working after the partition-reassignment was done [09:44:42] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3): Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10JAllemandou) Ping @BTullis on this - Could we get apriorization on either a presto version bump or an adaptation of our version... [09:48:57] 10Data-Platform-SRE, 10decommission-hardware: decommission an-test-client1001.eqiad.wmnet - https://phabricator.wikimedia.org/T343520 (10Stevemunene) [09:49:18] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for next deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/962234 (https://phabricator.wikimedia.org/T347939) (owner: 10Gerrit maintenance bot) [09:49:31] 10Data-Platform-SRE, 10decommission-hardware: decommission an-test-client1001.eqiad.wmnet - https://phabricator.wikimedia.org/T343520 (10Stevemunene) 05Open→03Resolved [09:49:39] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10Stevemunene) [09:50:09] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3): Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10BTullis) Yes, will do. {T342343} is already in our prioritized backlog, so I'm hoping to get version 0.283 of presto deployed v... [09:51:27] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3): Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10JAllemandou) >>! In T266641#9219175, @BTullis wrote: > Yes, will do. {T342343} is already in our prioritized backlog, so I'm ho... [09:54:18] fyi I just resumed our flink test job running from dse k8s (reading and writing to kafka-jumber) [09:54:26] s/jumber/jumbo [09:56:57] ack dcausse - does it have a dedicated topic to write it's reconciliation to? [09:58:47] joal: kind of, added a quick option to disable producing to these topics (working on a more generic solution in the meantime) [09:59:33] ack dcausse - no problem at all, just trying to keep an up-to-date mental model :) [10:25:29] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) [10:35:06] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [10:36:00] ^ I will bump the value in the alert for this check. I increased the heap by 4GB, but haven't modified the threshold in the check yet. [10:38:34] 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) Here's the full list of IP addresses that are being modified. ` (base) btullis@marlin:~/tmp$ cat clouddumps1... [11:35:52] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Next batch rouberol@kafka-jumbo1010:~/topicmappr$ topicmappr rebuild --topics '^(eqiad.android.image_recommendation_event|eqiad.android.image_recommendation_interaction|eqiad.android.install_referrer... [11:38:55] Hi btullis - for people to be able to ssh deployment.eqiad.wmnet, they need to be in the analytics-deploy group, right? [11:40:34] We have Surbhi Gupta now doing ops-week, who appears not to have the right :( (user sg912) [11:40:55] Is this change something we can do quickly, or do you need a formal demand btullis ? [11:44:22] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Next batch ` brouberol@kafka-jumbo1010:~/topicmappr$ topicmappr rebuild --topics '^(eqiad.change-prop.retry.mediawiki.job.flaggedrevs_CacheUpdate|eqiad.change-prop.retry.mediawiki.job.htmlCacheUpdate... [11:58:15] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Next batch ` topicmappr rebuild --topics '^(eqiad.changeprop.retry.mediawiki.revision-create|eqiad.changeprop.retry.mediawiki.revision-visibility-change|eqiad.changeprop.retry.resource_change|eqiad.ci... [12:03:35] joal: Oh, sorry. I've just seen your message. [12:04:39] np btullis [12:04:47] so, how shall we do btullis ? [12:05:20] sgupta would have to be in any one of these groups to ssh to a deployment server. https://github.com/wikimedia/operations-puppet/blob/production/hieradata/role/common/deployment_server/kubernetes.yaml#L8-L28 [12:06:25] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10netops, and 2 others: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10JAllemandou) [12:06:51] Let me check, perhaps analytics-deployers is right ofr ops week duties, or perhaps analytics-admins. [12:07:04] makes sense btullis [12:07:18] I'd go fpr analytics-deployers, but you know best :) [12:08:24] Also btullis, sgupta should be given the right to send PRs to the puppet repo, in order for her to update the AQS druid snapshot [12:08:41] (sorry to bother about this btullis :S - Once done, it'll be over for this person) [12:08:50] I think it's analytics_admins because members of that group are then made members of analytics-deployers here: https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/data/data.yaml#L897-L900 [12:09:01] It's no bother. I'll sort it now. [12:09:08] thanks a milion :) [12:11:06] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Next batch ` topicmappr rebuild --topics '^(eqiad.eventgate-main.test.event|eqiad.eventlogging_ContentTranslationAbuseFilter|eqiad.eventlogging_EditAttemptStep|eqiad.ios.edit_history_compare|eqiad.io... [12:14:41] joal: Would you mind adding a comment to here please, stating the requirement re ops week? https://phabricator.wikimedia.org/T335657 [12:15:18] I've got rights to approve the group membership, but it would be good to have some record of the request. [12:16:32] btullis: https://phabricator.wikimedia.org/T335657#9219848 [12:22:21] Puppet patch ready. https://gerrit.wikimedia.org/r/c/operations/puppet/+/963022 [12:25:57] 10Data-Platform-SRE, 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) Thanks! `lang=shell-session taavi@cloudcontrol1006 ~ $ os server list --long --all-projects --host cloudvirt-wdqs1001 +---------------------------------... [12:28:15] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, 10Patch-For-Review: mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10CodeReviewBot) otto merged https://gitlab.wikimedia.org/repos/data-en... [12:30:44] btullis: sorry to bother again - can you add sg912 to the puppet gerrit repo? [12:32:10] I'm doing it as we speak. I believe that it is the wmf LDAP group that I mentioned here: https://phabricator.wikimedia.org/T335657#9186606 - I thought someone else picked it up after I mentioned it, but it seems not. Just doing double-checks. [12:34:36] <3 [12:37:29] `wmf` LDAP group done (that should also allow Grafana login/edit, which was also requested) - puppet patch merged, Puppet running on deploy2002.codfw.wmnet now. [12:38:54] 10Data-Platform-SRE: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10brouberol) @Ottomata can you confirm that I can delete these topics? They all have RF=1: ` brouberol@kafka-jumbo1010:~/topicmappr$ kafka topics --describe | grep 'ReplicationFactor:1' | grep -... [12:40:41] https://www.irccloud.com/pastebin/UQjjPRqT/ [12:41:22] joal: Please feel free to try the deploy again and log out/in of gerrit to pick up the new privileges. [12:41:38] sure [12:42:39] 10Data-Platform-SRE, 10Patch-For-Review: Reassign partitions away from kafka-jumbo100[1-6] to kafka-jumbo10[07-15] brokers - https://phabricator.wikimedia.org/T346425 (10brouberol) [12:43:40] 10Data-Platform-SRE: Install kafka-kit binaries on kafka brokers - https://phabricator.wikimedia.org/T346764 (10brouberol) 05Open→03Resolved [12:43:43] 10Data-Platform-SRE, 10Patch-For-Review: Reassign partitions away from kafka-jumbo100[1-6] to kafka-jumbo10[07-15] brokers - https://phabricator.wikimedia.org/T346425 (10brouberol) [12:43:48] 10Data-Platform-SRE: Package kafka-kit binaries (topicmappr, metricsfetcher, ...) as a debian-package - https://phabricator.wikimedia.org/T346763 (10brouberol) 05Open→03Resolved [12:43:52] 10Data-Platform-SRE, 10Patch-For-Review: Reassign partitions away from kafka-jumbo100[1-6] to kafka-jumbo10[07-15] brokers - https://phabricator.wikimedia.org/T346425 (10brouberol) [12:48:00] (SystemdUnitFailed) firing: wikidatardf-truthy-dumps.service Failed on snapshot1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:01] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10Gehel) [13:03:33] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10dcausse) @bking only one host has to be loaded with the full dataset. The loading process can be started as soon as possible but there are few constr... [13:10:56] 10Data-Engineering, 10EventStreams, 10Event-Platform: Make eventgate-analytics-external the default event service - https://phabricator.wikimedia.org/T342610 (10phuedx) >>! In T342610#9216725, @Ottomata wrote: > Is this just so it doesn't have to be set manually for analytics devs creating streams? Yes. >... [13:11:49] btullis (and others): I wrote a description for our Superset goal (https://docs.google.com/document/d/1dfz-aeKFRMlYJEtzhNjoIruxMNsnlMbAE4NNe4u8ZIw/edit). If you could review and comment... [13:12:19] gehel: Great, thanks. Will do. [13:18:11] 10Data-Platform-SRE, 10Discovery-Search: Investigate recent CirrusSearch p95 latency - https://phabricator.wikimedia.org/T347988 (10bking) [13:18:46] 10Data-Platform-SRE, 10Discovery-Search: Investigate recent CirrusSearch p95 latency - https://phabricator.wikimedia.org/T347988 (10bking) [13:19:43] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10dr0ptp4kt) For me the first 300 GB of the file went really, really fast. But `axel` was dropping connections, similar to when I had downloaded t... [13:23:28] 10Data-Platform-SRE, 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by taavi@cumin1001 for hosts: `cloudvirt-wdqs1002.eqiad.wmnet` - cloudvirt-wdqs1002.eqiad.wmnet (*... [13:32:44] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:33:13] 10Data-Platform-SRE: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10brouberol) And more broadly, are you using any of these topics? ` brouberol@kafka-jumbo1010:~/topicmappr$ kafka topics --describe | grep -i otto | awk '{ print $1 }' | sed 's/Topic://' | grep -... [13:33:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:34:12] 10Data-Engineering, 10EventStreams, 10Event-Platform: Make eventgate-analytics-external the default event service - https://phabricator.wikimedia.org/T342610 (10Ottomata) > Does anything fail if a developer tries to create a stream without destination_event_service set? Nothing explicitly, but there won't b... [13:34:19] 10Data-Platform-SRE, 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by taavi@cumin1001 for hosts: `cloudvirt-wdqs1003.eqiad.wmnet` - cloudvirt-wdqs1003.eqiad.wmnet (*... [13:36:52] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE-Access-Requests, 10Event-Platform: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Ottomata) [13:37:17] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE-Access-Requests, 10Event-Platform: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Ottomata) Updated description and tagged #sre-access-requests [13:38:41] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Next reassignment will only span `eqiad.mediawiki.api-request` as it's a large topic. ` brouberol@kafka-jumbo1010:~/topicmappr$ topicmappr rebuild --topics 'eqiad.mediawiki.api-request' --brokers 100... [13:39:52] 10Data-Engineering, 10Event-Platform, 10MW-1.41-notes (1.41.0-wmf.29; 2023-10-03), 10Platform Team Initiatives (New Hook System): Update EventBus to use the new HookContainer/HookRunner system - https://phabricator.wikimedia.org/T346539 (10Ottomata) @Umherirrender is CentralNotice part of this task? If no... [13:43:40] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [13:46:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:42] 10Data-Engineering, 10EventStreams, 10Event-Platform: Make eventgate-analytics-external the default event service - https://phabricator.wikimedia.org/T342610 (10phuedx) 05Open→03Resolved a:03phuedx Being **bold**. In this case it's better to be explicit. Maybe we could write a test to check that the p... [13:47:44] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:48:06] 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) @jbond - Could I ask you to cast your eye over this (T346165#9219409) please, if you have a little time. We... [13:48:42] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10bking) [13:49:17] btullis is there a way to proactively mute an alert in alertmanager? My understanding is that if the alert does not yet fire, it's not visible in alertmanager. [13:49:49] 10Data-Platform-SRE, 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by taavi@cumin1001 for hosts: `cloudvirt-wdqs1001.eqiad.wmnet` - cloudvirt-wdqs1001.eqiad.wmnet (*... [13:51:17] brouberol: I think you can do it with label matching from the top-right corner https://usercontent.irccloud-cdn.com/file/XLXDMEyC/image.png [13:51:42] Or from the CLI: https://wikitech.wikimedia.org/wiki/Alertmanager#Add_a_silence_via_CLI [13:51:58] ...but I confess I've not done either much. [13:52:57] (03PS14) 10Milimetric: Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu) [13:53:23] Thanks! Using the first method I was able to browse previous expired silences and re-create one [13:55:07] (03CR) 10Milimetric: "ooh, 14's my lucky number. This is being merged with a couple of caveats:" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu) [13:55:08] 10Data-Engineering, 10Event-Platform, 10MW-1.41-notes (1.41.0-wmf.29; 2023-10-03), 10Platform Team Initiatives (New Hook System): Update EventBus to use the new HookContainer/HookRunner system - https://phabricator.wikimedia.org/T346539 (10Umherirrender) a:03Umherirrender >>! In T346539#9220207, @Ottomat... [13:55:21] (03CR) 10Milimetric: [C: 03+2] Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu) [13:55:26] 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team, 10Patch-For-Review: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10jbond) @BTullis from a quick look the resources that is changing is the ferm::service resource... [13:58:07] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10Ottomata) Wow, before deploying this, we were backfilling with 20 replicas and doing ~100 m... [13:59:50] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye [14:05:32] (03Merged) 10jenkins-bot: Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu) [14:15:55] 10Data-Platform-SRE: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10Ottomata) Please delete all of those! Thanks. [14:16:27] 10Data-Platform-SRE: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10Ottomata) You can also probably delete ANY topic that has ksql in it. We've never used KSQL in prod. [14:31:08] 10Data-Engineering, 10EventStreams, 10Event-Platform: Make eventgate-analytics-external the default event service - https://phabricator.wikimedia.org/T342610 (10Ottomata) > Maybe we could write a test to check that the property is present on all streams? It doesn't need to be present on all streams, since n... [14:35:06] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [14:37:01] 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team, 10Patch-For-Review: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) >>! In T346165#9220435, @jbond wrote: > @BTullis from a quick look the resources that... [14:38:32] 10Data-Engineering, 10Event-Platform, 10MW-1.41-notes (1.41.0-wmf.29; 2023-10-03), 10Patch-For-Review, 10Platform Team Initiatives (New Hook System): Update EventBus to use the new HookContainer/HookRunner system - https://phabricator.wikimedia.org/T346539 (10Ottomata) Oh right! I had forgotten those wh... [14:42:46] 10Data-Engineering, 10EventStreams, 10Event-Platform: Make eventgate-analytics-external the default event service - https://phabricator.wikimedia.org/T342610 (10phuedx) TIL [14:46:47] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform, 10Patch-For-Review: Enable canary events for all streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) ^ oops, wrong task. [15:12:59] 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team, 10Patch-For-Review: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BBlack) We could add some normalization function at the ferm or puppet-dns-lookup layer perhaps... [15:20:03] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed wi... [15:33:45] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Event-Platform: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676 (10Ottomata) [15:37:47] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed wi... [15:43:17] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [TEMPLATE] Onboard request for APPLICATION NAME to Event Platform - https://phabricator.wikimedia.org/T346207 (10Ottomata) Can this be declined? Or is this the actual Template? [15:46:30] 10Data-Engineering, 10EventStreams, 10Event-Platform: Event streams don't respect milliseconds UTC unix epoch timestamp in since parameter - https://phabricator.wikimedia.org/T345606 (10Ottomata) Thanks for the report! Getting it into triage... [15:49:52] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye [15:49:56] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [15:57:43] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Five deleted Wikidata items pertaining to Wikimedia category pages still present in the Query Service - https://phabricator.wikimedia.org/T342593 (10Ottomata) > I agree with @Milimetric here... [16:03:11] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Rename and release stream as mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T341783 (10Ottomata) 05Open→03Resolved a:03Ottomata I believe this is already done. Resolving. [16:05:00] 10Data-Engineering, 10Event-Platform: mediawiki-event-enrichment deployment process should include producing an event in staging and verifying success - https://phabricator.wikimedia.org/T341138 (10Ottomata) This is a great idea. EventGate does this for its readiness probe. We can do this in prod too, but pr... [16:05:49] 10Data-Engineering, 10Event-Platform: mediawiki-event-enrichment deployment process should include producing an event in staging and verifying success - https://phabricator.wikimedia.org/T341138 (10Ottomata) Oh wait this filed by past me. Great idea past me: doh. [16:07:04] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: eventutilities-python EventProcessFunction throws NPE if user func returns None - https://phabricator.wikimedia.org/T335706 (10Ottomata) 05Open→03Resolved [16:12:09] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: flink-app: swift bucket and zookeeper paths should be templated. - https://phabricator.wikimedia.org/T336901 (10Ottomata) Related: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/959059 [16:15:25] 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team, 10Patch-For-Review: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10jbond) @BTullis if its the zero padding we could be hitting a bug in [[ https://github.com/ruby... [16:17:22] 10Data-Engineering, 10Anti-Harassment, 10Data Engineering and Event Platform Team, 10Privacy Engineering, and 4 others: Exposing revIDs (nothing more) of deleted/suppressed edits for research to respect their removal - https://phabricator.wikimedia.org/T200559 (10Ottomata) 05Open→03Resolved a:03Ottoma... [16:19:14] 10Data-Engineering, 10Machine-Learning-Team, 10Wikimedia Enterprise, 10Data Engineering and Event Platform Team (Sprint 3), and 2 others: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10Ottomata) [16:19:16] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Machine-Learning-Team, 10Research, 10Event-Platform: Proposal: Create a stream end point for Revision Risk Model - https://phabricator.wikimedia.org/T326179 (10Ottomata) [16:20:08] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-extensions-WikimediaEvents, 10Product-Analytics, and 3 others: Decommission the EditorActivation instrument - https://phabricator.wikimedia.org/T330766 (10Ottomata) How we doin here? :) [16:21:18] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-Vagrant, 10Event-Platform: EventBus should not blackhole undeclared streams - https://phabricator.wikimedia.org/T329480 (10Ottomata) 05Open→03Invalid [16:22:39] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Automated event stream throughput alerting for important state change streams - https://phabricator.wikimedia.org/T329070 (10Ottomata) Related: {T345195} [16:23:52] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [16:29:56] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [EPIC] Streaming and event driven Python services - https://phabricator.wikimedia.org/T324689 (10Ottomata) [16:30:02] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [NEEDS GROOMING] Integrate Flink Table API in eventutils-python - https://phabricator.wikimedia.org/T324953 (10Ottomata) 05Open→03Declined I'm going to be bold and decline this one. If/when we decide to really really suppor... [16:30:42] 10Analytics-Radar, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10SRE, and 2 others: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088 (10Ottomata) [16:31:54] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Refactor EventBus extension Hooks to use new hook system - https://phabricator.wikimedia.org/T320655 (10Ottomata) [16:32:03] 10Data-Engineering, 10Event-Platform, 10MW-1.41-notes (1.41.0-wmf.29; 2023-10-03), 10Patch-For-Review, 10Platform Team Initiatives (New Hook System): Update EventBus to use the new HookContainer/HookRunner system - https://phabricator.wikimedia.org/T346539 (10Ottomata) [16:32:28] 10Data-Engineering, 10Data Engineering and Event Platform Team: Drop GuidedTour* tables - https://phabricator.wikimedia.org/T317460 (10Ottomata) [16:34:34] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Add schema diffing support to jsonschema-tools and run diff in CI - https://phabricator.wikimedia.org/T321850 (10Ottomata) An alternative to do this would be to also materialize another file with a static name that has the full... [17:02:44] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:04:32] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w... [17:05:06] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:02] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w... [17:15:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:44] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:18:29] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10bking) [17:18:36] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink - https://phabricator.wikimedia.org/T326914 (10bking) [17:19:22] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10bking) [17:19:33] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink - https://phabricator.wikimedia.org/T326914 (10bking) [17:29:02] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10dr0ptp4kt) Here's the `sha1sum` for the latest file I had downloaded: ` /mnt/x$ time sha1sum wikidata.jnl.zst 62327feb2c6ad5b352b5abfe9f0a4d3cc... [17:33:30] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye [17:33:34] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [17:34:01] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w... [17:34:39] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [17:35:04] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w... [17:49:29] 10Data-Engineering, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'` - https://phabricator.wikimedia.org/T347647 (10dr0ptp4kt) I did manage to run a `sha1sum`... [17:55:01] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-extensions-WikimediaEvents, 10Product-Analytics, and 3 others: Decommission the EditorActivation instrument - https://phabricator.wikimedia.org/T330766 (10phuedx) [17:55:18] 10Analytics-Kanban, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Fundraising-Backlog, and 3 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10phuedx) [17:55:36] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-extensions-WikimediaEvents, 10Product-Analytics, and 3 others: Decommission the EditorActivation instrument - https://phabricator.wikimedia.org/T330766 (10phuedx) 05Open→03Resolved a:03phuedx Being **bold**. [18:28:48] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10bking) Moving to blocked, as we cannot start the test service until T326914 is resolved. [18:35:06] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [18:40:23] 10Data-Engineering, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'` - https://phabricator.wikimedia.org/T347647 (10dr0ptp4kt) 05Open→03Resolved I'm going... [18:48:23] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w... [19:11:39] 10Analytics, 10Data-Engineering-Radar, 10Data Engineering and Event Platform Team, 10Product-Analytics, 10Event-Platform: [MEP] Determine how stream configuration is authored and deployed - https://phabricator.wikimedia.org/T269774 (10Ottomata) 05Open→03Declined being bold and declining, cc @phuedx [19:13:26] 10Analytics-Radar, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Duplicated revision_create events - https://phabricator.wikimedia.org/T262203 (10Ottomata) @Milimetric could these also be null edits? [19:14:08] 10Analytics-Kanban, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-extensions-EventLogging, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10Ottomata) [19:14:12] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Metrics Platform Backlog, 10Event-Platform: Document in-schema who sets which fields - https://phabricator.wikimedia.org/T253392 (10Ottomata) 05Open→03Declined Being bold and declining, feel free to reopen. [19:15:31] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye [19:15:35] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [19:15:55] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w... [19:15:57] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w... [19:16:37] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye [19:16:40] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [19:24:11] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10Ottomata) a:03Ottomata [19:38:07] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w... [19:41:54] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [20:24:07] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [20:49:45] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w... [20:56:39] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w... [21:14:31] 10Data-Platform-SRE, 10Discovery-Search: Investigate recent CirrusSearch p95 latency - https://phabricator.wikimedia.org/T347988 (10bking) [21:18:00] (SystemdUnitFailed) firing: wikidatardf-truthy-dumps.service Failed on snapshot1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:00:34] 10Data-Platform-SRE, 10Discovery-Search: Investigate recent CirrusSearch p95 latency - https://phabricator.wikimedia.org/T347988 (10bking) [22:03:26] 10Data-Platform-SRE, 10Discovery-Search: Investigate recent CirrusSearch p95 latency - https://phabricator.wikimedia.org/T347988 (10bking) Looked at this today with @Gehel and @RKemper at SRE pairing. We focused on the specific time frame of 2023-10-03 02:00:00 to 2023-10-03 09:00:00 . High latency affected b... [22:35:06] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [23:23:42] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2023/2024-Q1): Replace Quarry with an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Audiodude) So is it correct that we're looking for a new maintainer, but only in the capacity of migrating all usage of Quarry to Superset?...