[08:22:28] <_joe_> there is an increase in requests to proton, causing the outage [08:23:49] https://grafana.wikimedia.org/goto/OfpcHixIz?orgId=1 [08:25:11] <_joe_> zhwiki [08:26:07] <_joe_> logs for proton [08:27:05] <_joe_> it's a scraper [08:27:12] <_joe_> I am inclined to ban them [08:28:11] from https://logstash.wikimedia.org/app/dashboards#/view/e1bb3340-f997-11e8-b3c1-4ff0065d7257?_g=h@e78830b&_a=h@c01765a I see some from zh and some (legitimate?) from en [08:29:08] more from zh than en [10:21:34] arnoldokoth: OK for me to merge your blackbox check as well? [10:23:03] does someone know how I can copy an existing alert from the alertmanger UI and come up with its CLI equivalent, or if not, copy it somehow to be used later? [10:23:29] context is that we have an alert in place (c118b532-a297-42d5-ade3-9f0c50065aa0) and this will be useful later as well so I wanted to save it/clone it/etc [10:23:37] *silenced alert [10:24:18] arnaudb: Yes. [10:24:31] done! [10:24:32] sukhe: what do you mean? [10:25:20] volans: I want to copy an alert and use it later with some changed parameters (site, address) or find out how I can convert an existing alert into the CLI equivalent [10:29:52] sukhe: dunno vi the CLI but curl "http://alertmanager-eqiad.wikimedia.org/api/v2/silence/c118b532-a297-42d5-ade3-9f0c50065aa0" [10:29:55] works [10:31:22] you could also manage all this via spicerack's alertmanager support ;) [10:35:52] volans: curl for the above ID works. I am guessing spicerack's alertmanager support will allow me to POST them? [10:37:02] volans: I will RTM https://doc.wikimedia.org/spicerack/v2.2.0/api/spicerack.alertmanager.html?highlight=alertmanager :) [10:37:58] you can creaate custom matchers, yes: https://doc.wikimedia.org/spicerack/master/api/spicerack.alertmanager.html#spicerack.alertmanager.Alertmanager.downtime [10:38:13] that or the context manager downtimed() [10:38:41] cool thanks! [11:05:26] hello folks! [11:05:30] !incidents [11:05:30] 4550 (ACKED) [8x] ProbeDown sre (probes/service esams) [11:05:30] 4549 (RESOLVED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway eqiad) [11:05:52] If the on-callers are ok I'd proceed with the maintenance to the Docker Registry nodes [11:06:47] more info in https://phabricator.wikimedia.org/T360637 [11:07:04] cc: fabfur, _joe_ [11:07:32] <_joe_> elukey: please do if there is no deployment ongoing [11:07:44] we have esams depooled but this shouldn't affect this at all [11:07:59] so for me is go [11:09:27] _joe_ I don't see anything ongoing so far [11:27:53] registry2003 done, going to wait a bit and then I'll repool and proceed with 2004 [11:41:33] 2004 done as well, going to repool in a bit [11:45:07] all done [11:45:33] going to monitor for a bit, ping me in case something odd happens [12:10:30] is there any standard way to prevent apt/dpkg from starting a daemon? We have the use case for Jenkins, the deb package uses start-stop-daemon if that matters [12:13:17] yes, you can mask it, there is stuff in puppet to do that [12:13:49] that is what I though but the package can not be updated when the service is maskd [12:13:53] I gotta investigate that part :) [12:51:36] !incidents [12:51:36] 4550 (RESOLVED) [8x] ProbeDown sre (probes/service esams) [12:51:37] 4549 (RESOLVED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway eqiad) [13:50:12] re: the esams incidents, I don't know what silences were issued though we can/should silence site=esams and that is supposed to take care of all/most things cc sukhe [13:51:00] godog: I wanted to silence site=esams and a specific set of IP addresses for the probes. but yeah, I have a better idea now I guess :) [13:51:26] I was hesitant of a blanket site=esams silence, is what I am trying to say [13:52:14] ah I see, but yeah site=esams is supposed to do the right thing, if it doesn't we'll fix it [14:03:41] well except we might want to see the unexpected ones (e.g. if phy maint on those 8x hosts accidentally broke a fiber for one of our esams transits, or knocked another related host offline, etc) [14:04:22] yeah in this case, we left upload cluster untouched for that reason; it wasn't downtimed in case it was affected [14:06:28] the stateless-ness of the UX on AM is what bugs me in these scenarios [14:07:48] stevemunene: coming to the wmcs/de sync by chance? The calendar says you rsvp'd yes [14:08:20] but maybe that's just me being less-familiar with it. but always seems a bit chancy even figuring out what to pre-silence there, without an easy hierarchical view like icinga/nagios. [14:09:44] yes, thanks andrewbogott [14:09:51] (and IIUC, you can only see active+suppressed in the main view, no way to see things that are in a nominal state?) [14:18:28] bblack: that's right yeah, the configured prometheus alerting rules are available from prometheus itself, e.g. https://prometheus-esams.wikimedia.org/ops/alerts [14:20:46] and yes indeed re: unexpected alerts then it makes sense not to silence the site as a whole [14:30:23] oh that's a nice view, thanks! [14:32:10] sure np [14:32:44] yeah thanks, this is definitely TIL for me as well and this also shows the actual rules, which is nice [14:52:44] sukhe fabfur and the rest of traffic folks, so far so good [14:53:16] effie: congrats and it will be good! [14:53:16] 👍 [14:53:50] sukhe: :) luckily that was the easiest part [14:53:56] (famous, last, words) [14:55:36] https://grafana.wikimedia.org/goto/PKTw3ixIz?orgId=1 picking up [14:56:27] !incidents [14:56:29] 4551 (ACKED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway codfw) [14:56:29] 4550 (RESOLVED) [8x] ProbeDown sre (probes/service esams) [14:56:29] 4549 (RESOLVED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway eqiad) [16:00:49] brett: fabfur: denisse: I'm going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1015051 as supposedly the scraping has been stopped for a while, hopefully it does not cause any issues or pages, but I'll be watching [16:05:58] cdanis: ACK, thank you. [16:39:20] effie: do you know if the acmechief_host was affected by the switchdc? [16:40:24] or should I ask traffic directly? [16:41:15] all the overrides in puppet's hieradata point to acmechief2002 but the value in common points to acmechief1001, but was chnaged in november [16:49:49] okay, above is fully deployed, traffic levels look reasonable for both egress to Facebook and on the eqiad<>codfw transport [16:56:25] volans: I think 2002 is hardcoded, but nothing on my radar about it related to the switchover [16:56:28] I think acmechief1001 is the default in puppet for puppet5 and acmechief2002 is the puppet7 one but it would be nice to have confirmation [16:56:53] wikitech doesn't help :) [17:22:14] Afaik DC switch doesn't impact acmechief at all [17:39:08] ack, so 1001 for puppet5 and 2002 for puppet7 is correct? [18:33:51] brett, denise: beware https://phabricator.wikimedia.org/T361133 [18:34:30] x1 replica on codfw is close to breakage, I have mitigated to the best of my ability [18:35:16] mysql at db2196 will need a restart + SET global read_only= 0; if it crashes [18:35:34] please pass it on to other people on call this week, call the dbas if in doubt [18:36:12] I don't want to touch it further to change a potential outage for a real one while it holds [18:41:45] volans: yeah acme-chief should not be affected by the switchover and yes, 1001 is the active host and 2002 for puppet 7 [18:42:17] Thx sukhe [18:53:20] thanks for confirming sukhe. doh papaul is not here [18:53:56] notified [19:01:51] feel free to point him to us if it helps :) [19:43:39] !incidents [19:43:39] 4551 (RESOLVED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway codfw) [19:43:39] 4550 (RESOLVED) [8x] ProbeDown sre (probes/service esams) [19:43:39] 4549 (RESOLVED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway eqiad) [20:16:18] !incidents [20:16:18] 4552 (ACKED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway codfw) [20:16:19] 4551 (RESOLVED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway codfw) [20:16:19] 4550 (RESOLVED) [8x] ProbeDown sre (probes/service esams) [20:16:19] 4549 (RESOLVED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway eqiad) [21:42:19] j.ynus: ack