[08:22:28] <_joe_>	 there is an increase in requests to proton, causing the outage
[08:23:49] <fabfur>	 https://grafana.wikimedia.org/goto/OfpcHixIz?orgId=1
[08:25:11] <_joe_>	 zhwiki
[08:26:07] <_joe_>	 logs for proton
[08:27:05] <_joe_>	 it's a scraper
[08:27:12] <_joe_>	 I am inclined to ban them
[08:28:11] <fabfur>	 from https://logstash.wikimedia.org/app/dashboards#/view/e1bb3340-f997-11e8-b3c1-4ff0065d7257?_g=h@e78830b&_a=h@c01765a I see some from zh and some (legitimate?) from en
[08:29:08] <fabfur>	 more from zh than en
[10:21:34] <arnaudb>	 arnoldokoth: OK for me to merge your blackbox check as well?
[10:23:03] <sukhe>	 does someone know how I can copy an existing alert from the alertmanger UI and come up with its CLI equivalent, or if not, copy it somehow to be used later?
[10:23:29] <sukhe>	 context is that we have an alert in place (c118b532-a297-42d5-ade3-9f0c50065aa0) and this will be useful later as well so I wanted to save it/clone it/etc
[10:23:37] <sukhe>	 *silenced alert
[10:24:18] <arnoldokoth>	 arnaudb: Yes.
[10:24:31] <arnaudb>	 done!
[10:24:32] <volans>	 sukhe: what do you mean?
[10:25:20] <sukhe>	 volans: I want to copy an alert and use it later with some changed parameters (site, address) or find out how I can convert an existing alert into the CLI equivalent
[10:29:52] <volans>	 sukhe: dunno vi the CLI but curl "http://alertmanager-eqiad.wikimedia.org/api/v2/silence/c118b532-a297-42d5-ade3-9f0c50065aa0"
[10:29:55] <volans>	 works
[10:31:22] <volans>	 you could also manage all this via spicerack's alertmanager support ;)
[10:35:52] <sukhe>	 volans: curl for the above ID works. I am guessing spicerack's alertmanager support will allow me to POST them?
[10:37:02] <sukhe>	 volans: I will RTM https://doc.wikimedia.org/spicerack/v2.2.0/api/spicerack.alertmanager.html?highlight=alertmanager :) 
[10:37:58] <volans>	 you can creaate custom matchers, yes: https://doc.wikimedia.org/spicerack/master/api/spicerack.alertmanager.html#spicerack.alertmanager.Alertmanager.downtime
[10:38:13] <volans>	 that or the context manager downtimed()
[10:38:41] <sukhe>	 cool thanks!
[11:05:26] <elukey>	 hello folks!
[11:05:30] <elukey>	 !incidents
[11:05:30] <sirenbot>	 4550 (ACKED)  [8x] ProbeDown sre (probes/service esams)
[11:05:30] <sirenbot>	 4549 (RESOLVED)  GatewayBackendErrorsHigh sre (proton_cluster rest-gateway eqiad)
[11:05:52] <elukey>	 If the on-callers are ok I'd proceed with the maintenance to the Docker Registry nodes
[11:06:47] <elukey>	 more info in https://phabricator.wikimedia.org/T360637
[11:07:04] <elukey>	 cc: fabfur, _joe_
[11:07:32] <_joe_>	 elukey: please do if there is no deployment ongoing
[11:07:44] <fabfur>	 we have esams depooled but this shouldn't affect this at all
[11:07:59] <fabfur>	 so for me is go
[11:09:27] <elukey>	 _joe_ I don't see anything ongoing so far
[11:27:53] <elukey>	 registry2003 done, going to wait a bit and then I'll repool and proceed with 2004
[11:41:33] <elukey>	 2004 done as well, going to repool in a bit
[11:45:07] <elukey>	 all done
[11:45:33] <elukey>	 going to monitor for a bit, ping me in case something odd happens
[12:10:30] <hashar>	 is there any standard way to prevent apt/dpkg from starting a daemon? We have the use case for Jenkins, the deb package uses start-stop-daemon if that matters
[12:13:17] <volans>	 yes, you can mask it, there is stuff in  puppet to do that
[12:13:49] <hashar>	 that is what I though but the package can not be updated when the service is maskd
[12:13:53] <hashar>	 I gotta investigate that part :)
[12:51:36] <denisse>	 !incidents 
[12:51:36] <sirenbot>	 4550 (RESOLVED)  [8x] ProbeDown sre (probes/service esams)
[12:51:37] <sirenbot>	 4549 (RESOLVED)  GatewayBackendErrorsHigh sre (proton_cluster rest-gateway eqiad)
[13:50:12] <godog>	 re: the esams incidents, I don't know what silences were issued though we can/should silence site=esams and that is supposed to take care of all/most things cc sukhe 
[13:51:00] <sukhe>	 godog: I wanted to silence site=esams and a specific set of IP addresses for the probes. but yeah, I have a better idea now I guess :)
[13:51:26] <sukhe>	 I was hesitant of a blanket site=esams silence, is what I am trying to say
[13:52:14] <godog>	 ah I see, but yeah site=esams is supposed to do the right thing, if it doesn't we'll fix it
[14:03:41] <bblack>	 well except we might want to see the unexpected ones (e.g. if phy maint on those 8x hosts accidentally broke a fiber for one of our esams transits, or knocked another related host offline, etc)
[14:04:22] <sukhe>	 yeah in this case, we left upload cluster untouched for that reason; it wasn't downtimed in case it was affected
[14:06:28] <bblack>	 the stateless-ness of the UX on AM is what bugs me in these scenarios
[14:07:48] <andrewbogott>	 stevemunene: coming to the wmcs/de sync by chance? The calendar says you rsvp'd yes
[14:08:20] <bblack>	 but maybe that's just me being less-familiar with it.  but always seems a bit chancy even figuring out what to pre-silence there, without an easy hierarchical view like icinga/nagios.
[14:09:44] <stevemunene>	 yes, thanks andrewbogott 
[14:09:51] <bblack>	 (and IIUC, you can only see active+suppressed in the main view, no way to see things that are in a nominal state?)
[14:18:28] <godog>	 bblack: that's right yeah, the configured prometheus alerting rules are available from prometheus itself, e.g. https://prometheus-esams.wikimedia.org/ops/alerts
[14:20:46] <godog>	 and yes indeed re: unexpected alerts then it makes sense not to silence the site as a whole
[14:30:23] <bblack>	 oh that's a nice view, thanks!
[14:32:10] <godog>	 sure np
[14:32:44] <sukhe>	 yeah thanks, this is definitely TIL for me as well and this also shows the actual rules, which is nice
[14:52:44] <effie>	 sukhe fabfur and the rest of traffic folks, so far so good 
[14:53:16] <sukhe>	 effie: congrats and it will be good!
[14:53:16] <fabfur>	 👍
[14:53:50] <effie>	 sukhe: :) luckily that was the easiest part
[14:53:56] <effie>	 (famous, last, words)
[14:55:36] <sukhe>	 https://grafana.wikimedia.org/goto/PKTw3ixIz?orgId=1 picking up
[14:56:27] <denisse>	 !incidents 
[14:56:29] <sirenbot>	 4551 (ACKED)  GatewayBackendErrorsHigh sre (proton_cluster rest-gateway codfw)
[14:56:29] <sirenbot>	 4550 (RESOLVED)  [8x] ProbeDown sre (probes/service esams)
[14:56:29] <sirenbot>	 4549 (RESOLVED)  GatewayBackendErrorsHigh sre (proton_cluster rest-gateway eqiad)
[16:00:49] <cdanis>	 brett: fabfur: denisse: I'm going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1015051 as supposedly the scraping has been stopped for a while, hopefully it does not cause any issues or pages, but I'll be watching
[16:05:58] <denisse>	 cdanis: ACK, thank you.
[16:39:20] <volans>	 effie: do you know if the acmechief_host was affected by the switchdc?
[16:40:24] <volans>	 or should I ask traffic directly?
[16:41:15] <volans>	 all the overrides in puppet's hieradata point to acmechief2002 but the value in common points to acmechief1001, but was chnaged in november
[16:49:49] <cdanis>	 okay, above is fully deployed, traffic levels look reasonable for both egress to Facebook and on the eqiad<>codfw transport
[16:56:25] <effie>	 volans: I think 2002 is hardcoded, but nothing on my radar about it related to the switchover
[16:56:28] <volans>	 I think acmechief1001 is the default in puppet for puppet5 and acmechief2002 is the puppet7 one but it would be nice to have confirmation
[16:56:53] <volans>	 wikitech doesn't help :)
[17:22:14] <vgutierrez>	 Afaik DC switch doesn't impact acmechief at all
[17:39:08] <volans>	 ack, so 1001 for puppet5 and 2002 for puppet7 is correct?
[18:33:51] <jynus>	 brett, denise: beware https://phabricator.wikimedia.org/T361133
[18:34:30] <jynus>	 x1 replica on codfw is close to breakage, I have mitigated to the best of my ability
[18:35:16] <jynus>	 mysql at db2196 will need a restart + SET global read_only= 0; if it crashes
[18:35:34] <jynus>	 please pass it on to other people on call this week, call the dbas if in doubt
[18:36:12] <jynus>	 I don't want to touch it further to change a potential outage for a real one while it holds
[18:41:45] <sukhe>	 volans: yeah acme-chief should not be affected by the switchover and yes, 1001 is the active host and 2002 for puppet 7
[18:42:17] <vgutierrez>	 Thx sukhe 
[18:53:20] <volans>	 thanks for confirming sukhe. doh papaul is not here
[18:53:56] <volans>	 notified
[19:01:51] <sukhe>	 feel free to point him to us if it helps :)
[19:43:39] <fabfur>	 !incidents
[19:43:39] <sirenbot>	 4551 (RESOLVED)  GatewayBackendErrorsHigh sre (proton_cluster rest-gateway codfw)
[19:43:39] <sirenbot>	 4550 (RESOLVED)  [8x] ProbeDown sre (probes/service esams)
[19:43:39] <sirenbot>	 4549 (RESOLVED)  GatewayBackendErrorsHigh sre (proton_cluster rest-gateway eqiad)
[20:16:18] <denisse>	 !incidents
[20:16:18] <sirenbot>	 4552 (ACKED)  GatewayBackendErrorsHigh sre (proton_cluster rest-gateway codfw)
[20:16:19] <sirenbot>	 4551 (RESOLVED)  GatewayBackendErrorsHigh sre (proton_cluster rest-gateway codfw)
[20:16:19] <sirenbot>	 4550 (RESOLVED)  [8x] ProbeDown sre (probes/service esams)
[20:16:19] <sirenbot>	 4549 (RESOLVED)  GatewayBackendErrorsHigh sre (proton_cluster rest-gateway eqiad)
[21:42:19] <brett>	 j.ynus: ack