[00:14:58] hmm, cloudelastic went from yellow to red but no logged shard failures yet. 18 shards report 'DONE' instead of failed from commonswiki_file/_recovery. Might finally work [00:15:15] maybe it went red when the first shard became complete, unclear [00:17:35] i'm curious if tiny chunks or massively restricted bandwidth made the difference, or maybe both. But not sure i'm willing to keep going through this trying things and failing to find out :P [01:18:48] hmm, it should have 32 shards, but commonswiki_file/_recovery reports 24 shards done, and nothing currently in progress :S [01:19:38] oh, its allocation settings: "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes", [01:24:23] meh, not seeing any great answers. docs only refer to creating an empty primary shard via reroute apis. Guess delete and retry importing with that setting overrideden [01:28:40] how odd, DELETE against commonswiki_file in cloudelastic said {"acknowledged": false}. But then the index is gone. *shrug* [13:04:17] greetings [14:10:37] workout, back in ~30 [14:47:28] \o [14:47:32] commonswiki_file has returned! [14:48:00] i better turn some replicas on before it has trouble :P [14:54:53] also i think i figured out what caused the alerts from cirrus yesterday, the timing coincides with a wider incident: https://docs.google.com/document/d/10bMD6SOiq0jhsoM2k9ZCm9WLy5ag4ko7v7csqTkBQSE/edit# [14:55:11] i suppose i was looking too narrowly at our own things to notice a bunch of other things temporarily on fire [14:57:00] i [14:59:07] I don't have strong proof, but running out of appserver workers and having db issues at same time with such close timing feels like the right one [15:01:40] sounds reasonable [15:15:07] can't help but be amused: https://pointerpointer.com/ [15:58:51] even my most advanced cursor positioning is no match for ^^ ! [16:44:22] early lunch, then physical therapy....back around 1:30 or so [16:44:33] sorry, back in 2 hrs or so [17:16:43] hmm, poking at what causes spark 'Timeout waiting for task', solutions seem not-optimal. They vary from 'turn off the shuffle service', which is a significant part of what allows dynamic executor allocation to work, or 'set the timeout really really high' [17:17:56] i suppose specifically these are timeouts from org.apache.spark.shuffle.FetchFailedException [18:11:30] meh, ran for 2 hours (instead of ~1 fora succesfull run) and then failed :( lacking other options maybe will bump the timeouts for jobs with large shuffles. From reading it seems to be related to how many other things are also pushing on the shuffle service at same time (and they have a reported fix in spark 3) [18:12:04] but the ticket for the fix also says increasing timeouts don't really resolve the issue, just makes them less frequent [18:12:53] anyways, lunch time. Taco tuesday sounds about right [18:29:25] ryankemper ebernhardson moved the pairing session to 2PM PDT/4 PM CDT since MrG isn't around [18:30:25] ack, works for me [19:00:25] back [19:02:17] * ebernhardson sees someone mention the largest hourly event partition is 4.6G. and yup, thats us :P [19:03:12] i guess noone else tried to use it like we do, as a record of interaction with another service and recording lots of data about the requests/responses [19:17:26] ebernhardson re: snapshot restore, if you have any notes feel free to add to https://wikitech.wikimedia.org/wiki/Search/S3_Plugin_Enable , although we should probably rename to "Backup or Restore" . It's also linked on our main page https://wikitech.wikimedia.org/wiki/Search#Backup_and_Restore [19:21:19] inflatador: wow thats already much better documentation than i was probably going to write :) Will update it with extra bits about throttling everything down and that we don't know what appropriate throttle levels are, but these "worked" [19:25:56] Awesome! Good to just get something up there, particularly the restore command and about how long it took. [19:28:54] FWiW, looks like Dell does have a Linux based installer for NIC firmware, I'm testing it out on a second host now. Worked fine on the first [19:58:14] OK so the next issue is that the BIOS has the integrated NIC as the second boot option, and the 10G NIC as 3rd option. Both are PXE, and it doesn't seem like the PXE environment can recover from a boot failure...it immediately boots off the hard drive after the first NIC fails to boot [19:58:35] :S [20:00:24] Dell has management software that can tweak this stuff at the OS level and apply it at next boot, but it's not installed anywhere at WMF ;( [20:57:08] random crazy idea, what if metastore had a per-wiki document containing various cirrus configuration. That document could be read every 10 minutes and cached in-memory on each app server. The problem is then we need tooling to manage configuration that lives somewhere else :P [22:17:34] loooks like the racadm stuff is already documented https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Ceph#BIOS (still gonna add to the main DC Ops dell page though) [22:19:49] https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Changing_BIOS_Boot_Order [23:33:03] updated the docs for snapshot/restore. Hopefully didn't forget too much [23:33:05] anyways, calling it a day