[01:15:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:20:42] (SystemdUnitFailed) resolved: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:35:42] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:37:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:09:03] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-libs-HTTP, 10Beta-Cluster-reproducible, and 3 others: PHP Warning: curl_multi_remove_handle(): supplied resource is not a valid cURL Multi Handle resource - https://phabricator.wikimedia.org/T288624 (10tstarling) During request shu... [05:35:58] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:55:55] * brouberol waves good morning [07:06:13] btullis: the pcc run seems to work for https://gerrit.wikimedia.org/r/c/operations/puppet/+/957918, and we're seeing the new broker being added to the related hosts config. I have a few related questions [07:06:35] good morning brouberol and others :) [07:07:54] - how do you envision deploying the 6 brokers? Deploying the first one, then redeploying all related consumers/producers, checking that everything is running smoothly, then redeploying the other brokers? Or maybe redeploying the first broker, then all remaining 5 if everything works well, and then and only then the producers/consumers? [07:10:01] - why do we maintain the list of brokers in producer/consumer configs in the first place? When they come online, the producers/consumers only need to contact one broker to get the topics/partitions placement metadata, and then they contact IP/ports directly, as stored in zookeeper. What we used to do in $PREV_JOB was to only store a roundrobin DNS [07:10:02] that would resolve to a random broker, and that was enough. This allowed us to not redeploy consumers/producers when we added/removed brokers in the cluster [07:13:32] morning joal! [07:20:00] And additional question: do we have a service name resolving to the broker IPs? I tried kafka-jumbo.svc.eqiad.wmnet and kafka-jumbo.eqiad.wmnet, to no avail [08:16:47] Morning team :-) [08:25:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:30:08] brouberol: As to your first question, I think it's possibly your call. I think that the first option strikes me as the safest; deploy one broker, restart producers/consumers, check for errors, evaluate whether to do the next lot of brokers together or separately. But that's just my feeling, rather than an instruction. [08:35:03] brouberol: As to your second question, I agree that although we technically only need to provide a single 'seed' broker for it to connect successfully to all of them, that's not currently how the puppet module is set up. Ref: https://github.com/wikimedia/operations-puppet/blob/production/modules/confluent/manifests/kafka/broker.pp#L11-L16 [08:38:48] was stat1005 offline? (sorry I have just rejoined this channel and can't see backscroll) [08:40:42] brouberol: Actually, I linked to the wrong file, that's obviously about the broker setup, which only needs to know about the zookeeper and the broker is. For a producer like varnishkafka, you'll see that it currently gets an initial broker list containing all of the brokers. https://github.com/wikimedia/operations-puppet/blob/production/modules/varnishkafka/templates/varnishkafka.conf.erb#L218-L219 [08:41:17] kostajh: I'm not aware of an incident involving stat1005, but I can check for you. Did you experience anything unusual? [08:41:45] btullis: I see an alert here, actually [08:42:34] btullis: yeah, I evaluated a Jupyter cell, went to do some other things, and then couldn't reload the page. I tried dropping and recreating my SSH tunnel. Then I realized I couldn't SSH into stat1005 at all. [08:42:46] I was about to ask if it was offline, when the SSH connection went through [08:43:21] kostajh: Oh yeah, something big happened: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=stat1005&var-datasource=thanos&var-cluster=analytics [08:44:09] btullis: is it possible I caused the problem with the cell I evaluated? [08:47:07] kostajh: I think that's quite likely. I see that your process was definitely killed and it seems to have been using a lot of RAM at the time. [08:47:11] https://www.irccloud.com/pastebin/hxXGdkdQ/ [08:47:40] oops. sorry! [08:48:25] btullis: it might happen again, then, because I tried a modified version of the cell and it is taking >300 seconds so far. [08:48:59] https://www.irccloud.com/pastebin/L5ShCCyn/ [08:49:44] where edit_attempt_blocks has 65,000 rows and st47proxybot_blocks has 180,000 rows. It doesn't seem like this should be problematic? [08:50:07] These things happen. :-) I'd say you're probably right. It's climbing past 74% of the system's memory at the moment, according to `htop` [08:50:51] oh dear. [08:50:52] We start killing user processes at 90% https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients#Resource_management and I suspect that yours will be first again. [08:51:21] can I kill the process or do I have to wait for the system to do it? [08:52:56] Interesting, now it's maxed out the CPUs like in the other graph and htop is no longer responding for me. [08:54:58] We can try to kill it, if you can get any response. your jupyter session has pid `25902` and you have permission to kill it. I can also try, but it depends on whether we can get a signal through at the moment. [08:55:25] btullis: I just ran `kill 25902` [08:55:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:56:05] Instantaneous response. System load dropped off a cliff. [08:56:48] btullis: hmm. I am not sure what is problematic about that cell, though. [09:02:19] kostajh: No, I'm afraid I don't either. [09:04:38] btullis: I am going to try a more simplified version of that cell, if that's ok. [09:04:49] I will remove the `.drop()` statements. [09:20:15] kostajh: I think the problem comes from you doing a cross-join: you join every row of the left side with every row of the right side - This makes, given the number you gave, 65k*180k ~ 11g rows - that's probably why it breaks memory :) [09:20:45] ah [09:21:49] I killed the process again [09:43:32] !log disable auth_jaas and native login to datahub then enable oidc authentication to production in codfw T305874 [09:43:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:43:36] T305874: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 [09:51:06] !log disable auth_jaas and native login to datahub then enable oidc authentication to production in eqiad T305874 [09:51:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:51:08] T305874: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 [10:42:06] !log deploy datahub in codfw to pick up new changes T305874 [10:42:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:42:09] T305874: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 [10:45:17] !log deploy datahub in eqiad to pick up new changes T305874 [10:45:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:46:08] 10Data-Platform-SRE: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10BTullis) There are some potentially helpful notes and links to reference patches here, from a ticket that I worked on some time ago: {T255148} The nodes in this ticket are the druid 'public' cluster, as opposd... [11:21:01] (03PS5) 10Aqu: Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) [11:27:26] (03CR) 10CI reject: [V: 04-1] Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu) [11:42:16] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) [11:53:34] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1140.eqiad.wmnet with OS bullseye [12:08:07] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1141.eqiad.wmnet with OS bullseye [12:09:49] When we make changes in `site.pp`, are these changes picked up by a pcc run, or does it use the site.pp from production? [12:10:35] I can't figure out why the PCC runs in https://gerrit.wikimedia.org/r/c/operations/puppet/+/958436/ fail for kafka-jumbo1001 and kafka-jumbo1010, saying that no puppet role is assigned to these nodes. [12:11:26] The regex seem to match: [12:11:26] >>> re.match(r"kafka-jumbo10(0[0-9]|1[0-5])\.eqiad\.", "kafka-jumbo1010.eqiad.wmnet") [12:11:26] [12:11:26] >>> re.match(r"kafka-jumbo10(0[0-9]|1[0-5])\.eqiad\.", "kafka-jumbo1001.eqiad.wmnet") [12:11:26] [12:16:18] brouberol: that's tricky, pcc needs facts for the hosts to be collected in order to work, so it already needs to be up & running [12:17:57] so yeah.... it needs the host to be listed on site.pp in production already and the host up & at least one puppet run [12:18:04] so, the node *is* up and running, and I'm trying to assign the role(kafka::jumbo::broker) role to it by changing the regex in site.pp. The regex introduced in the previous change seem to not work, as I'm getting an error on the node saying that no role is assigned to that node [12:18:05] another Jupyter question, I am now seeing `An error occurred while trying to connect to the Java server` when I try to evaluate a cell. Any ideas how to fix this? I am on stat1005 [12:18:21] https://www.irccloud.com/pastebin/jlL8TISE/ [12:18:52] My guess is that the change in that gerrit change would fix it, but I can't get the pcc run to validate. But given what you say, I'm guessing this is a property of this system? [12:19:58] said another way: if puppet is currently failing on a node, any pcc runs in the PR fixing the role assignment won't be able to compile the manifest for this particular node? [12:24:04] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1140.eqiad.wmnet with OS bullseye executed with errors: - an-worker1140 (**FAIL**) - Downtimed on Ic... [12:45:26] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [NEEDS GROOMING][SPIKE] Can we identify indicators to inform an SLO for event emission and intake? - https://phabricator.wikimedia.org/T345195 (10gmodena) Previous iterations of analytics/data engineering maintained alerts for... [12:47:55] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1141.eqiad.wmnet with OS bullseye completed: - an-worker1141 (**PASS**) - Downtimed on Icinga/Alertm... [12:55:58] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:14:10] > if puppet is currently failing on a node, any pcc runs in the PR fixing the role assignment won't be able to compile the manifest for this particular node? [13:14:41] !log Puppet run successfully on kafka-jumbo1010.eqiad.wmnet. The kafka service is running. T336041 [13:14:44] brouberol: I think that it can compile in the changed branch, but it returns a noop because it can't compile in the production branch then compare a diff of the two. [13:14:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:14:44] T336041: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 [13:14:56] btullis: that makes sense! [13:18:35] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10bking) ^^ Last message can be ignored; we reimaged wdqs1016 after a role change. Still done! [13:28:09] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10Stevemunene) >>! In T332570#9174460, @ops-monitoring-bot wrote: > Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1140.eqiad.wmnet with OS bullseye executed with errors:... [13:31:12] 10Data-Platform-SRE, 10Data-Catalog: DataHub rights assignment is case-sensitive - https://phabricator.wikimedia.org/T309382 (10Stevemunene) a:03Stevemunene [14:17:33] brouberol: o/ I think that we all forgot one thing, namely the FRtech's kafkatee instance [14:17:44] it doesn't run in production, but in a separate network [14:17:53] (they have their own puppet etc..) [14:17:58] all for PCI compliance [14:18:08] and this instance is connected to kafka-jumbo? [14:18:14] it is yes [14:18:23] I see in alerts.w.o: CRITICAL: kafka-jumbo1010:down, kafka-jumbo1010:unconfigured [14:18:32] not a big deal, just create a task for them to upgrade [14:19:08] Should I run the redeploy myself, or should I reach out to someone? [14:20:25] We don't operate on their network stack, you can use this tag in phab https://phabricator.wikimedia.org/project/view/1091/ [14:20:46] not sure what specific needs to be changed on their side, you can mention the alert and that there is a new broker [14:20:52] (and more down the line etc..) [14:29:09] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1142.eqiad.wmnet with OS bullseye [14:44:41] 10Data-Platform-SRE, 10Patch-For-Review: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol) [14:45:18] 10Data-Platform-SRE, 10Patch-For-Review: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol) [14:48:01] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1143.eqiad.wmnet with OS bullseye [14:50:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:51:28] I had no idea about the FRtech kafkatee instance myself. [14:55:10] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform, 10Patch-For-Review: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10bking) [15:00:56] 10Data-Engineering, 10Data-Engineering-Wikistats: Add punjabi language in Wikistats - https://phabricator.wikimedia.org/T344572 (10Milimetric) This is now done [15:10:38] I'm trying to assess what services will need to be restarted to take the new list of kafka jumbo brokers into account. I found a couple of services in `deployment-charts`, but is there a way to get these automatically without grepping in all our repositories? [15:11:27] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1142.eqiad.wmnet with OS bullseye completed: - an-worker1142 (**PASS**) - Downtimed on Icinga/Alertm... [15:12:57] brouberol: you can check https://codesearch.wmcloud.org/search/ for things that are in repositories, but I'm sure there migh be other places and for that I defer to he experts ;) [15:13:04] elukey: I don't remember when your anniversary is, but Saturday was 4 years since you were added to the analytics gerrit group, so happy anniversary! [15:13:25] volans: thanks! [15:13:36] full disclousure: doesn't have *all* gerrit repos, but it's an opt-in [15:14:59] milimetric: o/ it is in January, the upcoming one is 8y :O [15:15:46] elukey: Wohoo. Happy non-anniversary ✨ [15:16:40] 10Data-Platform-SRE, 10Patch-For-Review: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol) [15:19:31] 10Data-Platform-SRE, 10Discovery-Search (Current work): Migrate apifeatureusage hosts to Bullseye or later - https://phabricator.wikimedia.org/T346053 (10EBernhardson) [15:22:57] 10Data-Platform-SRE, 10Patch-For-Review: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol) [15:25:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:25:59] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1143.eqiad.wmnet with OS bullseye completed: - an-worker1143 (**PASS**) - Downtimed on Icinga/Alertm... [15:27:40] 10Data-Platform-SRE, 10Discovery-Search (Current work): Consider migrating search-loader into Kubernetes - https://phabricator.wikimedia.org/T346189 (10EBernhardson) [15:29:25] 10Data-Engineering, 10serviceops, 10Discovery-Search (Current work), 10Event-Platform: Improve the flink-app chart to provide more useful defaults - https://phabricator.wikimedia.org/T346315 (10EBernhardson) [15:30:09] (03PS1) 10Brouberol: [Gobblin] Add kafka-jumbo1010 to config [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958511 (https://phabricator.wikimedia.org/T336041) [15:32:17] 10Data-Platform-SRE, 10Discovery-Search (Current work): Consider using git-lfs for elastic plugins repo - https://phabricator.wikimedia.org/T344462 (10EBernhardson) [15:37:21] 10Data-Platform-SRE, 10Patch-For-Review: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol) [15:38:00] 10Data-Platform-SRE, 10Patch-For-Review: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol) [15:38:45] !log deploying Superset 2.1.1 to an-tool1005 for superset-next.wikimedia.org [15:38:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:42:08] btullis: I have open CRs for all the apps I could find that are connected to kafka-jumbo, and added you as a reviewer. For now, we only add kafka-jumo1010 in config. [15:42:27] varnishkafka was redeployed w/ the new config automatically via puppet [15:42:28] OK, will take a look asap. [15:42:34] Great. [15:42:56] no rush, i have to take care of my daughter for a bit [15:43:35] the next step will be to figure out how to redeploy the apps in k8s with the new config. I expect I'll need to get setup for k8s/helm work [15:43:38] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1144.eqiad.wmnet with OS bullseye [15:43:55] 10Data-Platform-SRE, 10Discovery-Search (Current work): Migrate search-loader into Kubernetes - https://phabricator.wikimedia.org/T346189 (10EBernhardson) [15:43:57] 10Data-Platform-SRE, 10Discovery-Search (Current work): Migrate search-loader into Kubernetes - https://phabricator.wikimedia.org/T346189 (10bking) [15:44:31] We already saw a change happen with the datahub deployment this morning. It picked up the new kafka-broker automatically and showed us the diff when we applied the other change. All working as expected. [15:54:39] (03PS6) 10Aqu: Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) [15:56:41] (03PS1) 10Fabian Kaelin: Update knowledge gap metrics into cassandra loading hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958518 (https://phabricator.wikimedia.org/T345446) [15:57:24] brouberol: I suspect that these commits to deployment-charts could have been a single change, but it's also fine that they're separate. [16:02:40] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1145.eqiad.wmnet with OS bullseye [16:25:07] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1144.eqiad.wmnet with OS bullseye completed: - an-worker1144 (**PASS**) - Downtimed on Icinga/Alertm... [16:34:01] (03CR) 10CI reject: [V: 04-1] Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu) [16:38:08] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: [NEEDS GROOMING][SPIKE] Can we identify indicators to inform an SLO for event emission and intake? - https://phabricator.wikimedia.org/T345195 (10lbowmaker) [16:41:34] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1145.eqiad.wmnet with OS bullseye completed: - an-worker1145 (**PASS**) - Downtimed on Icinga/Alertm... [16:44:45] (SystemdUnitCrashLoop) firing: superset.service crashloop on an-tool1005:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [16:56:02] (03PS2) 10Btullis: Update to Superset version 2.2.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356) [17:14:14] 10Data-Engineering, 10Web-Team-Backlog (Needs Prioritization (Tech)): Deal with minified scripts in JS error logging - https://phabricator.wikimedia.org/T520 (10Jdlrobson) [17:23:17] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 2 others: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10akosiaris) a:05akosiaris→03None [17:28:32] btullis: I can collapse them if required. I did that to mirror the spirit in which we roll out the kafka brokers, but I get your point indeed [17:30:08] (03PS1) 10Mforns: Remove null entry from custom_data.[].value enum in monoschema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/958528 (https://phabricator.wikimedia.org/T344235) [17:32:46] I collapsed them all under https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/958496/1 [17:35:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [17:39:01] (03CR) 10CI reject: [V: 04-1] Remove null entry from custom_data.[].value enum in monoschema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/958528 (https://phabricator.wikimedia.org/T344235) (owner: 10Mforns) [17:49:21] 10Data-Engineering: Unable to spawn new environments - https://phabricator.wikimedia.org/T346397 (10Iflorez) [17:51:20] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: [NEEDS GROOMING][SPIKE] Can we identify indicators to inform an SLO for event emission and intake? - https://phabricator.wikimedia.org/T345195 (10gmodena) a:03gmodena [17:52:00] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10gmodena) a:03gmodena [17:57:36] 10Data-Engineering: Unable to spawn new environments - https://phabricator.wikimedia.org/T346397 (10Iflorez) [18:05:29] 10Data-Engineering: Unable to spawn new environments - https://phabricator.wikimedia.org/T346397 (10Iflorez) Thank you @BTullis ` conda-analytics-list ` works and I'm able to view and activate environments. I'll keep this page bookmarked: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Conda#U... [18:30:44] 10Data-Platform-SRE, 10Observability-Alerting, 10observability: Create VictorOps config for new Data Platform SRE team - https://phabricator.wikimedia.org/T344202 (10bking) To help move this forward, I've created a [[ https://wikitech.wikimedia.org/wiki/User:Bking/Notes/DPE_contact_plan | a DPE contact plan... [19:02:14] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Russian-Sites: "Active editors by country" doesn't display numbers for Belarus, Kazakhstan, Russia - https://phabricator.wikimedia.org/T333716 (10FriedrickMILBarbarossa) [19:17:36] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10gmodena) Following up after some investigation. I think we have two pro... [19:22:06] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host dbstore1008.eqiad.wmnet with OS bullseye [19:22:09] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host dbstore1009.eqiad.wmnet with OS bullseye [19:25:58] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:45:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [20:03:43] 10Data-Engineering, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Movement-Insights, and 2 others: Unique Devices seasonal trends on small projects - https://phabricator.wikimedia.org/T344381 (10XenoRyet) [20:09:58] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10BTullis) >>! In T341792#9102194, @bking wrote: > To check alerting, I removed suppressions and shut off flink-zk1001 via the ganeti... [20:27:14] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host dbstore1009.eqiad.wmnet with OS bullseye completed: -... [20:30:38] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host dbstore1008.eqiad.wmnet with OS bullseye completed: -... [20:31:41] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Jhancock.wm) [20:32:33] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Jhancock.wm) 05Open→03Resolved @Btullis completed [20:45:00] (SystemdUnitCrashLoop) firing: superset.service crashloop on an-tool1005:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [21:11:08] 10Data-Platform-SRE: Service implementation for wdqs101[4,5,6] - https://phabricator.wikimedia.org/T314890 (10RKemper) [21:21:37] 10Data-Platform-SRE, 10Patch-For-Review: Decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T344198 (10RKemper) Running decom cookbook for wdqs100[3,4]. Dc-ops ticket up here: https://phabricator.wikimedia.org/T346699 [21:41:12] (03PS3) 10Btullis: Update to Superset version 2.2.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356) [21:49:21] 10Data-Platform-SRE: Decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T344198 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ryankemper@cumin1001 for hosts: `wdqs1003.eqiad.wmnet` - wdqs1003.eqiad.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Found physical host... [21:50:57] 10Data-Platform-SRE: Decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T344198 (10RKemper) [22:09:04] 10Data-Platform-SRE: Decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T344198 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ryankemper@cumin1001 for hosts: `wdqs1004.eqiad.wmnet` - wdqs1004.eqiad.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Found physical host... [22:21:53] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10RKemper) 05Open→03Resolved [22:21:55] 10Data-Platform-SRE, 10Epic: [Epic] Migrate all Search Platform servers to Debian Bullseye - https://phabricator.wikimedia.org/T323921 (10RKemper) [22:28:30] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2): Document data pipeline and data set ownership - https://phabricator.wikimedia.org/T346295 (10Ahoelzl) [22:34:58] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2): Provide overview of common industry DQ methods and systems - https://phabricator.wikimedia.org/T346283 (10Ahoelzl) [22:37:08] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2): Provide overview of best DQ practices and system design - https://phabricator.wikimedia.org/T346283 (10Ahoelzl) [23:13:34] (03PS4) 10Btullis: Update to Superset version 2.2.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356) [23:21:43] (03PS5) 10Btullis: Update to Superset version 2.2.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356) [23:25:58] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:30:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:31:19] PROBLEM - Check systemd state on an-tool1005 is CRITICAL: CRITICAL - degraded: The following units failed: superset.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:56:58] (03PS6) 10Btullis: Update to Superset version 2.2.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356) [23:59:07] RECOVERY - Check systemd state on an-tool1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state