[09:03:22] FYI, going to reboot the host running grafana.wikimedia.org in ~5 minutes, let me know if it's a bad time and I'll postpone [10:09:58] there are quite a lot of active "CRITICAL - degraded: The following units failed:" alerts in Icinga, some since multiple days, week... [16:12:12] cdanis (and team): congrats on the official launch of wikimediastatus \o/ [16:12:46] legoktm: <3 thanks! [16:38:38] cdanis: is there a phab project? I have a bug to report :) [16:39:00] legoktm: there's not a tag but feel free to make a subtask of T202061 [16:39:00] T202061: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 [16:42:01] {{done}} - T305174 [16:42:02] T305174: Advertised RSS/Atom feeds for wikimediastatus.net don't work - https://phabricator.wikimedia.org/T305174 [16:43:41] also, people elsewhere would like to see what the page looks like when there's an active incident, do you have a screenshot off hand? [16:44:05] hmm [16:44:25] I have a few snaps of the graphs going wonky, not sure if I have one of the text up top changing [16:44:36] one minute [16:44:56] <_joe_> yeah the only time since launch it would be needed the issues were over before we could post [16:46:45] legoktm: I don't have one for our page specifically, however, what is currently happening on https://metastatuspage.com/ gives you some idea 😅 [16:47:11] haha [16:48:44] legoktm: in terms of the metrics, here's a sample outage (the ulsfo connectivity issue from weeks ago where multiple 'redundant' transit providers failed simultaneously) https://i.imgur.com/6hFRZtm.png [16:49:13] (I think so, anyway, I would have to dig to be sure) [16:49:26] oh actually no, this is a different outage [16:50:14] (the page doesn't turn black when the site is down, although that would be really cool) [16:50:43] about to reboot some lvses around the fleet, starting with all the backup lvses [16:50:47] <_joe_> rzl: i'm still curious about how much traffic they'll get if we're down [16:51:00] please no service catalog config changes for a while without coordination! [16:51:35] <_joe_> we should have a way to tell ci to refuse to merge changes that touch it during these operations [16:52:04] or work on the project that gets rid of it! :) [16:55:54] shout out to volans for the sre.hosts.reboot-single, it makes this much easier! :) [16:56:37] err, I guess moritz actually started that one, I was just glancing at the to of the git log at first! [16:58:44] ehehe yeah on thtat specific one I didn't do too much, but we also have much better abstractions for cluster-wide roll restarts, that was meant for very small clusters / one-off hosts :D [16:59:41] yeah, in this case I'm using it for all the lvs and dns hosts, and just wrapping my own little for loops to control ordering and overlap [17:00:08] but having the downtimes and icinga polling, etc built in is super nice [17:05:59] congrats on wikimediastatus, cdanis (and all involved!) [17:21:14] noob debian packaging question: when I run `gbp buildpackage` with `--git-dist=bullseye`, I still get `Distribution: buster-wikimedia` in the .changes file [17:21:30] is that because I have buster-wikimedia in `debian/changelog`? and if so what's the right way to build for both? [17:21:55] the way I have been told is you make a dummy commit that edits debian/changelog, generally with a different version [17:22:15] e.g. +wmf11u1 [17:22:36] rzl: I do dch -i, update the version next to title changelog and setting it to bullseye-wikimedia [17:23:37] cool, okay -- the next time I do a release, if I need to build for both, do I change it back and forth again? [17:24:00] (or is the idea that I make this commit on deneb but don't bother pushing it to gerrit?) [17:24:00] rzl: don't quote on me on this, but that's what I did for a bullsye backport and it ?works? [17:24:26] rad, thanks both [17:25:02] rzl: I generally did the latter [17:25:25] if you wind up in a sad situation where you need to maintain e.g. different patches for different Debian versions, then you start doing branched development on each and push it back [17:25:37] but if you just need to change the one line in debian/changelog, then you can just make a throwaway local commit [17:25:43] 👍 [17:27:25] rzl: you can see/copy from /home/volans/spicerack-release for example [17:27:37] _joe_: If I understand correctly, requestctl is not only for throttling/blocking but also for controlling "list of pooled ats backends to connect to". I'm trying to find in the code or docs where it does this. Or maybe that's not yet implemented? [17:27:43] that does the build for multiple distros [17:28:05] (I just recently removed buster, but youc an check the similar scripts for wmflib or cumin) [17:28:33] volans: ah cheers, I was cribbing from the spicerack repo but I was missing this part :) [17:28:47] that's on the build host [17:29:14] Krinkle: I don't believe it's intended to change that, no [17:29:16] yeah reading it now, thanks [17:29:23] Krinkle: although confd is used for that as well, just differently [17:29:31] ping me if you have doubts Reuven [17:32:20] cdanis: right, we have the bypall lists managed there I believe. The quote is from the requestctl README [17:35:23] Krinkle: the readme means that just like we used etcd to configure varnish with the pooled ats backends, we now also configure it with cloud IP ranges and ban rules [17:35:30] "requestctl is a conftool extension that allows to easily manage the latter two types of data" [17:36:16] so to manage the first type of data, you wouldn't use requestctl, but continue to use confctl as ever [17:38:15] I see. That makes sense in retrospect. It feels slightly odd to lead with a (complete?) list of things Varnish loads from etcd specifically here in this readme. I guess it's useful context to understand the tools limitations (e.g. we can't operate on stuff not in that list and/or would have to add it there). [17:38:16] Thanks [18:24:17] hmmm, back on the topic of reboot-single [18:24:21] (the cookbook) [18:24:45] I was assuming failures of icinga recovery would fail the script, but they don't. not sure if that' sintentional. [18:24:52] it's probably debateable in any case [18:24:55] The host's status is not optimal according to Icinga, please check it. [18:24:56] Deleted silence ID 14948fc3-c142-4108-90f1-f142794f4cc2 [18:24:57] END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2009.codfw.wmnet [18:25:14] ^ was the last 3 lines of a run [18:28:38] but it doesn't repool it [18:29:05] and it polls for a while for optimal icinga state right? [18:29:33] we could argue that it should either stop and ask confirmation before continuing OR not remove the downtime I guess [18:29:38] thoughts? [18:30:48] yeah in this case there's no pooling or repooling to do [18:31:12] it's more about the PASS + exit=0. Had it been one of my earlier for loops ove rit, would've been pretty ugly [18:31:47] you could make the argument that the primary task was to reboot, and it did reboot :) [18:32:14] but, I'd make the case that if icinga doesn't come back clean (or really anything else goes amiss) that it should be a FAIL + non-zero at least [18:34:42] agree on the exit status [18:34:59] as for the wait for user input it depends how unattended people use it [18:35:21] as for the not removing downtime I think it's better to alert now than at a later time, when the operator might be away by then [18:35:32] yeah agreed on the latter point [18:36:21] now to figure out why that pybal diff check is failing heh [18:36:23] so either stop (before removing the downtime) ask for user input, (retry the check / skip / abort) or exit with failure [18:36:33] it's been failing since this morning, on the IP added in https://gerrit.wikimedia.org/r/c/operations/puppet/+/774488/ [18:37:09] another option could be to get icinga status before reboot and check that after the reboot is the same or better :D [18:37:55] could be related to the fact that it's in state: lvs_setup ? [18:37:56] heh [18:38:13] I think it's that both backend hosts are weight=0 + pooled:inactive [18:38:36] lvs_setup should be an ok state as far as alerts go, I think [18:39:42] klausman: elukey: is it ok to pool those hosts? [18:39:57] bblack@cumin1001:~$ confctl select cluster=ml_staging get [18:39:57] {"ml-staging-ctrl2002.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=ml_staging,service=kubemaster"} [18:40:00] {"ml-staging-ctrl2001.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=ml_staging,service=kubemaster"} [19:05:38] I'm gonna Be Bold and pool them. even if they're failing, it'll clear a nuisance CRIT for pybal. either way it's brand-new and it's staging. [19:29:17] bblack: the immediate fix for sre.hosts.reboot-single to fail on icinga non-optimal is https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/775946 [19:30:08] \o/ [20:41:23] volans: more fun! [but I don't think you need to fix anything] [20:41:35] we have some icinga checks on the cp nodes which actually depend on services being pooled [20:41:57] Varnishkafka related? [20:41:58] and the reboot-single logic won't repool after --depool unless icinga is all-clear, so basically it can't be successful :) [20:42:44] great :/ that's why is better to use higher abstractions for larger clusters where we can fine-tune those things [20:42:53] yeah [20:43:10] although honestly, I think the problem here is how our hosts + checks are configured, not the script [20:43:46] that's also true , the check should be pooled-status-aware [20:43:57] it is, but it goes UNKNOWN when depooled [20:44:10] instead of OK [20:44:15] eh :) [20:44:16] fun [20:44:40] [20:44:40] check_trafficserver_log_fifo_notpurge_backend [20:44:49] -> UNKNOWN: service ats-be is not pooled [20:45:01] if the script returns any other code then 0 or 1 it would become UNKNOWN i think, not just 3 [20:46:18] it does it explicitly in this case [20:46:37] in /usr/local/lib/nagios/plugins/check_trafficserver_log_fifo [21:32:36] linking this here again from yesterday: [21:32:37] https://phabricator.wikimedia.org/T304089#7819217 [21:32:48] esams->drmrs failover test, starts in ~1.5 hours [21:33:45] not expecting any real issues, especially in the first several hours, but will keep an eye on things [22:47:08] https://gerrit.wikimedia.org/r/c/operations/dns/+/776004 <- is the esams depool patch that will go out shortly. In case there's an issue and one of us isn't available or responsive, that's the thing to revert. [22:47:53] I'll be watching the first bit, sukhbir will take over later for some hours, and then eventually mmandere will be watching in his TZ [23:25:24] after the initial ramp-in, hit a new high ~6.3Gbps output across all drmrs transits and peers. ~2.6Gbps transport backhaul while still doing some cache fill. [23:26:00] highest single transit is Telia at ~2.5Gbps [23:26:58] no huge NEL spikes or anything [23:27:24] the regional traffic is headed into its overnight low trough, so should be pretty uneventful for the next several hours. [23:29:36] it'll climb back up to about the levels we had just now, by sometime around ~06:00 UTC -ish, and then keep rising from there. That's when we'll really need to keep an eye on transits harder.