[09:16:00] https://www.irccloud.com/pastebin/5z4MHQEE/ [09:16:06] dcausse: ^ [09:26:17] gehel: actually these logs are quite old, would you have something that's around ~20:30 UTC yesterday? (ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)) [09:26:18] T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 [09:34:59] dcausse: looking [09:38:13] most recent logs on cumin2002 are from June 4. Or I'm not looking at the right place [09:38:19] cd .. [09:40:28] found it! [09:41:05] https://www.irccloud.com/pastebin/i70TOXHy/ [09:46:10] thanks! [10:02:28] lunch [12:31:46] gehel: is this the full log? is there anything happening before this? [12:33:17] There is probably more [12:33:30] I'll send it in 10' [12:33:52] thanks! [12:41:45] dcausse if you want me to put the logs in your homedir on people or something like that LMK [12:43:53] o/ [12:44:02] inflatador: yes that'd be great! :) [12:44:43] dcausse ACK, will do. gehel ^^ I can take the log stuff from here [12:45:14] inflatador: thancks! I'll let you do that! [12:50:28] dcausse I copied the logs to `wdqs-cookbook-logs` in your homedir on people1004. If you need any more LMK [12:50:41] thanks! [12:51:37] np [12:56:14] inflatador: do you have logs that are more recent (from yesterday) these ones seem old [12:58:28] dcausse oops, let me check [13:01:52] dcausse I'm not sure why, but I'm not seeing that matches "wdqs" or "hdfs" that's newer than June 4 [13:02:22] inflatador: on cumin2002? [13:02:30] dcausse correct [13:02:56] gehel: found some part of them in https://www.irccloud.com/pastebin/i70TOXHy/ but not sure what's the location [13:03:31] in ryan's home dir [13:12:47] in mtg but will pick it up once I'm out [13:24:38] gehel might be late to 1x1, working w/Ben on helm chart [13:24:51] inflatador: ack [13:24:54] dcausse: [13:24:59] https://www.irccloud.com/pastebin/kZDsuWWm/ [13:25:12] gehel: thanks! [13:25:51] ah, seems to only log the last command? [13:28:06] actually I want to see what happened before :) [13:28:12] the full log would be great :) [13:32:12] dcausse: full log in your homedir on elastic1090 [13:32:22] thanks! [13:40:40] gehel I ran out of time...have to take my son to camp ;( [13:41:15] ack, we can reschedule or see next week [14:33:40] taking my other son to camp...back in ~30 [14:48:41] ryankemper: made further adjustments to the reload cookbook (https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1038904/23..24/cookbooks/sre/wdqs/data-reload.py), if you get a chance to make another attempt today, could you copy the full logs in e.g. people.eqiad.wmnet:~dcausse/? thanks! [14:57:32] back [15:27:38] ebernhardson: fyi, ended-up creating https://gitlab.wikimedia.org/repos/search-platform/cirrus-rerender (copying most of the build system from the reindexer), somehow it felt wrong to add all this to cirrus-reindex-orchestrator [15:28:24] not super happy with the overhead that creates for such a small script, but not sure what other options we had [15:28:59] dcausse: makes sense, and yea i had pondered shuffling the subdirs so it was more of a `cirrus` module, but then it starts to seem like there should be more of a shared architecture...i dunno it's probably better to keep them separate for now [15:37:33] dr0ptp4kt: dcausse: should we schedule some time tomorrow? The RDF router is coming along, and so far I didn’t break any of the existing tests. However, I still have to write tests for the routing and resulting patches, which is going to be tedious. Also, ran into a few questions regarding the expected (stub) outcome. [15:38:31] pfischer: sure, we have the pairing the session at 2pm UTC, might work? [15:48:07] if we can use that slot that works for me. lemme know if I should schedule a different slot though [16:06:16] i keep saying the elastic percentiles grahs arent right ... i'm going to change the set of 4 graphs (p99+qps,p95,p75+qps,p50) into three the same everywhere (qps, p95, p50) [16:20:06] thanks! [16:21:06] inflatador: Do we know what's up with elastic2088 and elastic2099 at the moment? They're down, but I don't know if that's expected. [16:26:20] btullis no, that is not expected. Let me check [16:28:23] inflatador: Thanks. [16:29:45] btullis did you get an alert for these? I don't see anything yet [16:31:03] I did not get an alert. I only noticed because I was running some cumin commands against 'A:owner-search-platform' and these hosts timed out. [16:31:09] ryankemper are you running the roll-reboot cookbook for codfw ? ^^ [16:31:34] inflatador: no, codfw reboot's been done for a few days [16:32:12] ryankemper ACK, will reboot both hosts via cookbook and get back [16:32:25] let's check what row they are, I think there was some sort of row E/F maintenance this week? I forget timing then [16:32:40] nah, that was for eqiad [16:32:59] https://usercontent.irccloud-cdn.com/file/fvceGbP3/image.png [16:34:49] would be cool if we had a website or MOTD that listed all the cookbooks in progress [16:36:42] regardless, looks like the reboot cookbook doesn't work when SSH isn't up [16:36:49] DRAC time ;) [16:37:24] how would that help? that said is easy to make one, we have all the locks, I think I have a draft locally (and I can check them manually) [16:37:39] inflatador: untrue [16:38:01] volans see above, I have to check w/ others to make sure there's not a maintenance already going on [16:38:03] or you mean the rolling reboot? [16:38:41] inflatador: why? make the cookbook locks in a way that prevents you to run conflicting maintenances [16:38:44] even better than checking a list [16:38:48] that is error prone [16:39:29] The use case is not about running conflicting maintenances, see above [16:40:08] Cookbookwise you'll want to use `./cookbooks/sre/hosts/reboot-single.py` btw [16:40:37] as for the rebooting stuff, I'm running `sudo cookbook sre.hosts.reboot-single elastic2088.codfw.wmnet` and getting `cumin execution failed` [16:41:03] maybe I'm invoking it wrong? I tried `elastic2088*` and `elastic2088.codfw.wmnet` [16:41:34] reboot-single tries a clean reboot from the host yes, not via mgmt [16:42:08] Ah scratch that then [16:42:50] I don't think we have a simple reboot via mgmt cookbook, but we do reboot via mgmt on other cookbooks like reimage [16:43:15] ACK. maybe a --force option or something? Not a huge deal [16:44:16] yeah could make sense, feel free to propose a patch [16:44:48] will do, thanks [16:50:28] elastic percentiles dashboard updated: https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles [16:50:39] cooooool [16:52:57] a bit more consistent now, although i only changed the top percentiles section. Somehow the other 2/3 of this percentiles dashboard isn't percentiles :P [16:53:51] I think you mean 33.33% of the percentile dashboard IS percentiles ;P [16:54:39] lol [16:55:31] 2088 is back up. Still not sure why I didn't get an email or ping in IRC (I did get a ping for it coming back up) [16:55:46] i did get an instance not indexing alert for something, lemme check [16:56:21] huh, but not for these, weird [16:56:30] probably for the host we banned for the switch maintenance this morning [16:56:47] is it because they stopped submitting metrics all together, intead of submitting 0's? might want to change some alerts [16:57:08] or maybe not:P i dunno [16:57:09] they appeared as down in icinga, so not sure yet [17:01:12] unrelated curious thing. I merged the patch to remove zk network bits from SUP helmfile, since admin_ng can now provide that functionality. The apply clearly showed changes, but it didn't restart any pods. I guess network changes get applied without restarts? [17:01:53] * ebernhardson will probably force a restart anyways, to be sure [17:02:36] Yeah, I think the network policies just manipulate iptables/ipvs [17:05:46] makes sense [17:06:26] thanks for doing that regardless, I probably should've done it already [17:06:52] i was looking through open patches, this has apparently been waiting almost a month now. Seemed like something worth finishing [17:09:53] thanks! [17:10:13] lol, promising error message from gitlab cu: fatal: unable to access 'https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater.git/': Could not resolve host: gitlab.wikimedia.org [17:10:29] :) [17:11:03] oopsie [17:12:36] speaking about old patches I guess we could apply https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/966902 too [17:12:41] inflatador: ^ [17:13:25] dcausse ACK, I see that one on my gerrit page and then..I keep ignoring it ;( [17:14:01] np! should be safe to merge, nothing runs there [17:14:22] I'll merge after I finish this phab ticket, thanks for the reminder [17:17:28] made me randomly curious about the oldest open patches, as i should probably abandon a bunch of things to make the gerrit generic reports more useful. Apparently the oldest open and not-WIP patch is Sept 2014. https://gerrit.wikimedia.org/r/q/status:open,7100 [17:43:07] hmm, where might SRE have seen all the requests coming from SUP and noticed they didn't have a user agent? About to deploy the fix to provide http agents but wondering if i can see it fixed in any dashboard or report [17:44:45] claime: You noticed on 5-31 we were sending some requests without a user agent, i'm about to ship a fix and wondering where you were seeing them so i can verify the fix? [17:54:10] dinner [18:00:04] OK, merged the flink operator patches in dse-k8s [18:00:24] now I eat! Back in ~40 [18:00:38] err.r...back in time for pairing in :30 [18:41:48] inflatador: https://phabricator.wikimedia.org/T366363 [19:20:47] user agent update looks to have worked, can see the apache http user agent replaced with our custom agent [22:01:33] * ebernhardson avoids temptation to write a tiny expression language for blazegraph throttling via headers. [22:30:18] {◕ ◡ ◕} [22:43:34] * ebernhardson didn't avoid the temptation :P But i made it way simpler than my first idea