[07:58:50] morning. so I had a question for the puppet parser compiler: I have send a patch that touch a single file to modules/gerrit ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/706042/ ) and I forgot to add a `Hosts:` header in the message [07:59:19] if I run the puppet compiler against that without mentionning any host, is it smart enough to detect that the `gerrit` modules is only applied on gerrit1001 / gerrit2001 and thus only run against those hosts? [07:59:33] or would it trigger a run on a bunch of servers? [08:00:10] I am guessing jbond might know [08:13:38] hashar: you can simply manually specifiy on which host you want to run PCC? https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/build [08:19:47] hashar: it is not that smart :) [08:23:04] OH I forgot about the manual compiler. Thx moritzm! [08:25:58] also the edit commit message button on gerrit is pretty handy ;) [08:43:46] yeah I felt I could just `check experimental` and have the compiler handle the magic for me [08:43:54] laziness is my primary skill [10:01:57] ryankemper: fyi there is an alert fo wdqs DNS discovery https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=puppetmaster1001&service=DNS+Discovery+operations+diffs [10:56:58] do you downtime servers affected by the network change as part of the operation? eg, all servers in the row [10:57:14] otherwise I may downtime some of ours myself [10:59:02] cc topranks [11:00:38] arturo: In Icinga? No, given how things went on Tuesday I hadn't been planning on disabling any alerts/checks. [11:01:16] ok. I may leave them activated and see if there is any service disruption detected [11:01:21] We don't expect any alerts/checks if things go to plan, and thus the checks are useful to let us know that if, in fact, something does go offline for an extended period. [11:01:39] 👍 [11:01:55] I phrased that terribly but yes - we don't expect alerts. And if something doesn't go to plan they'll be useful to tell us that. [11:01:56] thanks :) [11:05:02] <_joe_> vgutierrez: can I restart pybal on lvs2010? [11:05:31] ^^ effie [11:05:37] yup, no problem on my side [11:05:53] he knows I can't say no to him, so he is asking you vgutierrez [11:05:57] <_joe_> lol [11:06:11] <_joe_> I'm actually trying to troubleshoot the issues we're seeing with mwdebug [11:06:21] oh.. those are expected and triggered by effie [11:06:27] * vgutierrez runs away [11:06:32] they are! [11:07:22] from lvs we are getting a timeout when connecting to mwdebug's vip [11:07:38] so we are tryingt o figure what was missed [11:08:20] <_joe_> effie: so now pybal on 2010 can connect to all the backends [11:08:28] what was the issue ? [11:08:38] <_joe_> that you were checking an inexistent url [11:08:43] <_joe_> I removed the proxyfetch [11:08:55] <_joe_> you should query enwiki's blank page like we do elsewhere [11:08:59] copypasta fail [11:09:04] <_joe_> or even don't [11:09:07] err I warned effie about that yetsterday :_) [11:09:09] <_joe_> just keep idleconnection [11:09:11] *yesterday [11:09:18] <_joe_> now [11:09:19] I know you did [11:09:28] <_joe_> on lvs2010 I can see all nodes as pooled [11:09:28] I just didnt know where to start looking [11:09:45] <_joe_> ipvsadm says all servers are pooled correctlyt [11:09:59] <_joe_> so I'd do that change anyways [11:10:42] should I include it in the monitoring_setup patch ? [11:10:44] <_joe_> effie: sorry one doubt [11:10:53] <_joe_> where were you doing the curl from? [11:11:14] lvs2010 [11:11:30] but I didn't specify an interace [11:11:33] interface [11:12:01] although, hmm [11:12:08] <_joe_> deploy1002:~ $ curl -k -I -H 'Host: en.wikipedia.org' -H 'X-Forwarded-Proto: https' https://10.2.1.59:4444/wiki/Main_Page [11:12:10] <_joe_> HTTP/1.1 200 OK [11:12:25] ok cool [11:12:35] <_joe_> the issue is doing the connection from the lvs host [11:12:38] <_joe_> it can't work there [11:12:49] <_joe_> you are connecting to an IP that is local [11:13:36] ah ok, I understand now [11:15:03] <_joe_> anyways, yes, remove the proxyfetch [11:15:19] <_joe_> it's not useful, the healthchecks are inside kubernetes [12:49:03] I'm having some trouble trying to filter out the alerts on alertmanager that I should be paying attention to and the ones I should not (given that there's a bunch with instance alert1001, so the filters become tricky). Is there a way to add a tag, say "owner" or "team" (or anything) to certain alerts so they can be filtered out that way? [12:55:21] dcaro: I'm assuming you mean the Icinga-prefixed alerts in alertmanager in this case, tl;dr "not at the moment" but it is something on the radar, see also https://phabricator.wikimedia.org/T284213#7212555 [12:55:49] dcaro: the icinga -> am exporter/client though is custom and python, if you can/want to take a stab at the problem [12:56:10] custom as in "built inhouse" [13:02:54] godog: I'd be happy to, is there anything I should know about it? (from 1 to 10, how much risk do you think it has to change it? and how many days you think it could take with 90% certainty? xd) [13:05:11] https://gerrit.wikimedia.org/r/admin/repos/operations/debs/prometheus-icinga-exporter ? [13:06:18] dcaro: yeah that's the repo, risk level I think it is reasonably low as there are tests [13:06:19] that looks like stats exporter [13:06:34] it does both, stats exporter and am client [13:06:36] am.py :) [13:07:26] probably a couple of days of coding and then deploy [13:07:36] ack [13:08:34] implementation wise I think what would make sense is read a configuration file to match alerts and tags to attach if there's a match [13:08:47] not sure if alert name alone is enough, could be [13:09:26] can a tag have multiple values? [13:10:10] within an alert no, also I called tags what prometheus calls labels but you get the idea [13:11:03] okok, that makes it a bit tricky as multiple config entries might match the same alerts [13:11:47] I might just leave a comment and leave that as a todo when it becomes a problem though [13:12:49] *nod* yeah that works too, thanks! [13:13:15] gotta run an errand, bbiab [13:13:34] \o, would you mind if I black/isort everything before the changes? [13:13:59] dcaro: that'd be most welcome! [13:14:12] 👍 [13:36:23] _joe_: checking https://gerrit.wikimedia.org/r/c/operations/puppet/+/706355, dropping ProxyFetch looks weird to me, we would be paging based on a L7 test but pybal itself would be only performing L4 monitoring [13:43:46] volans: how does one remove ipv6 dns entries for a host in netbox? [13:44:01] kormat: I can help with that [13:44:20] kormat: was added by mistake? [13:44:28] volans: yes [13:44:47] basically https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_can_I_add_the_IPv6_AAAA/PTR_records_to_a_host_that_doesn't_have_it? [13:44:52] https://phabricator.wikimedia.org/T282484 [13:44:54] just emptying the filed [13:44:56] *field [13:45:52] thanks [13:46:31] feel free to add/expand the docs [13:46:37] for the reverse operation :) [13:48:48] rzl: FYI Olja can def approve access requests, but she might need a special ping on slack or something :) [13:49:34] volans: are there any docs on the server status field in netbox? (active/staging/planned etc, and what they mean to us) [13:49:54] yes [13:50:08] https://wikitech.wikimedia.org/wiki/Server_Lifecycle#States [13:50:21] and all the transitions described below [13:51:50] mm. the semantics are more implied than stated. but ok, thanks [13:52:34] (e.g. why do we have a 'staged' state?) [13:53:22] handover from dcops to service owner usually [13:54:34] <_joe_> vgutierrez: no, pybal is completely useless for k8s-based services [13:54:42] <_joe_> actually the stupider it acts, the better [13:54:59] _joe_: so let's be coherent and page on L4 status as well at LVS level [13:55:21] <_joe_> vgutierrez: not sure I understand [13:55:48] if pybal only performs IdleConnection checks.. the associated check_command should be a check_tcp one instead of a check_https... [13:55:59] <_joe_> we do page for l4 status [13:56:19] <_joe_> vgutierrez: I disagree, that's not testing pybal, it's testing the loadbalanced service [13:56:34] via pybal [13:56:36] <_joe_> only difference is that kubernetes pools/depools pods based on its ReadinessProbe [13:56:46] yeah, I get that [13:57:04] <_joe_> yeah ok it's just a simple layer, we don't want to alert on each single server [13:57:35] <_joe_> once we are able to advertise bgp for our services from the hosts, we might just cut pybal off [14:00:19] volans: any idea how long for the dns update to stablize? i'm getting inconsistent results. e.g. one query for 'pc1011' from cumin1001 gives just a v4 result, and the next query gives v4+v6 [14:02:40] kormat: if the query was performed before, it's cached by the resolvers, you can wipe the record if needed [14:02:51] https://wikitech.wikimedia.org/wiki/DNS#How_to_Remove_a_record_from_the_DNS_resolver_caches [14:03:23] the authdns are consistent at the end of the cookbook, the resolvers have their own caches and will respect the TTL [14:03:31] _joe_: wouldnt though make sense to start alerting when we have unavailable pods due to k8s checks failing ? [14:03:43] rather than having pybal alerting us about it? [14:03:48] volans: perfect, thanks! [14:03:52] generally speaking I mean [14:04:19] <_joe_> effie: that's not how it works, and we don't have "pybal alerting us about it" [14:04:33] <_joe_> we have icinga checking the service via the loadbalancer [14:04:56] <_joe_> if you have more sensible ways of checking the service from an outside prespective, I'm happy to oblige [14:05:08] <_joe_> but then you need to convince alex too ofc [14:05:36] he is cheap, I think a chocolate bar would do the trick [14:06:01] <_joe_> oh come on, you need to do better than some candy :D [14:37:49] ottomata: yeah, if you weren't about to be back anyway, I would have tried that next :D but thank you, appreciated! [14:50:29] All - about to do the change on eqiad row C switches now in 10 mins [14:51:02] now or in 10 minutes? pick one :-P [14:51:37] you know the way us Irish people talk makes no sense. It's too late for me to stop now :) [14:51:47] volans: topranks is demonstrating the networking concept of split-brain [14:51:58] I thought that was the distributed-systems concept of clock skew [14:52:17] or something about special relativity if we're in real trouble [14:53:05] jbond: can you do the honours again re: disabling puppet? [14:56:38] topranks: I can do it now [14:56:53] moritzm: great, thanks :) [14:57:39] running, should be disabled in maybe a minute [14:58:34] puppet disabled fleet-wide [14:59:44] and just FYI topranks the disable-puppet script waits for any in-flight puppet run to complete (with a timeout), so when it exits with 0 it means no running puppet and puppet disabled [15:00:02] cool, thanks for the info [15:01:45] 🍿 time [15:01:59] yeah, it failed to disable on two out of 1722 hosts, that's typically broken hardware [15:03:17] Ok I think we are good to go [15:04:17] Executing... [15:05:46] Done. [15:06:11] done, done for real? [15:06:16] Seems good - "rapid" mode ping from CR (kind of like fping), lost 2 (one on egress, one on ingress change). [15:06:21] 100000 packets transmitted, 99998 packets received, 0% packet loss [15:06:32] done, done :) [15:06:42] nice, good job! [15:06:44] nicely done topranks [15:06:48] nice! [15:06:51] ok, I detected no issues on the couple of systems I was monitoring, nice work! [15:07:12] nice :D [15:07:14] great - I am still checking everything seems healthy - but looks to be :) [15:07:20] you're spoiling us [15:08:15] next time XioNoX leaves us without network for 1 minute we are all going to be pitchforks and torches [15:08:19] Seems to be a very quick change alright, and applied in sequence to all devices in the virtual-chassis. [15:09:10] I'm re-enabling Puppet [15:09:41] vgutierrez: looks like Juniper's hardware is trying hard, 1 switch failure in codfw and 1 linecard failure in codfw [15:10:36] at least we're not getting any router committing seppuku due to license bugs [15:12:45] Hardware just dying is one thing... license bug crashing it would make me very mad. [15:13:03] well... :) [15:13:38] https://phabricator.wikimedia.org/T193897 [15:14:25] 2018 already?? [15:14:28] * vgutierrez getting old [15:14:57] do we have a tasks hall of fame? [15:14:59] or shame :) [15:15:50] That's pretty new I would say XD [15:16:17] sounds like that was fun XioNoX :D [15:16:47] I think everything good with asw2-c-eqiad, LibreNMS stats on subsequent polls look good also. [15:16:48] "fun" :) [15:17:19] Thanks all for their vigilance, this is a sensitive change so it is appreciated, even if they are going so well thus far. [15:20:43] Krinkle: FYI: T284825 is complete. we now need to wait 3 weeks for the new PC hosts to be populated, and then we can make one or more of them primary and see what happens. [15:20:44] T284825: Productionize pc2011-pc2014 and pc1011-pc1014 - https://phabricator.wikimedia.org/T284825 [15:53:47] <_joe_> topranks: it's just a play by junos to fool you into feeling confident [15:53:55] <_joe_> the mess will strike with the last maintenance [15:54:11] I'm laughing but deep inside I know you're right :D [15:56:05] I would bet for the one before the last [15:56:16] the last would be predictable [15:56:40] https://en.wikipedia.org/wiki/Unexpected_hanging_paradox [21:40:09] legoktm: fyi, I've made some improvements to this dash. specifically to render a graph with the APCu usage of a single server when selected (previously the apcu row was always and overview). https://grafana-rw.wikimedia.org/d/GuHySj3mz/mediawiki-php-service [21:40:35] I vaguely recall there being another dash that shows this already, but I couldn't find it.