[08:47:00] <moritzm>	 FYI, in 15 mins the IDPs will be moved to new servers
[08:48:32] <godog>	 ack
[16:26:37] <robh>	 !log ganeti5003 firmware updates in progress via T308238
[16:26:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:41] <stashbot>	 T308238: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238
[16:31:10] <robh>	 !log ganeti5003 reboot accidental by rob, fixing
[16:31:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:18] <robh>	 its 3003, sigh.
[16:31:54] <robh>	 moritzm: So yeah, i messed up and rebooted ganeti5003 just now without downtiming or draining it
[16:32:00] <robh>	 cuz im dumb and meant to work on ganeti3003
[16:32:11] <robh>	 and used the idrac so it didnt force me to retype the hostname
[16:32:28] <robh>	 basically i bypassed all the checks to prevent this by doing it via ilom manually... sorry about that
[16:32:34] <XioNoX>	 robh: no big deal, let me check what's there
[16:32:40] <robh>	 its rebooting, i killed the firmware updates cuz they are already done on that host
[16:32:58] <robh>	 too many open tabs, i know this mistake.
[16:33:04] <robh>	 work on one thing at a time.
[16:33:46] <robh>	 its still booting
[16:34:11] <robh>	 ok, that is my mistake for this week, not allowed anymore ; D
[16:34:12] <cdanis>	 or not enough automation :)
[16:34:31] <robh>	 yeah, once we automate bios updates this kinda thing wont happen
[16:34:39] <robh>	 but i caused a false trail on that thinking it could tftp
[16:34:43] <XioNoX>	 easier to look at icinga to see what was there
[16:34:44] <robh>	 but NOPE, only the idrac can tftp update
[16:34:54] <robh>	 rest of firmware requires good old ftp
[16:35:21] <XioNoX>	 looks like doh5002 (cc sukhe) and netflow5002
[16:35:25] <cdanis>	 and prometheus5001
[16:35:48] <XioNoX>	 yep (cc herron)
[16:36:01] <XioNoX>	 I don't think anything is critical here
[16:36:23] <robh>	 when you say check icinga you mean just check for all down items cuz i dont see a list of whats on ganeti5003?
[16:36:28] <XioNoX>	 2 are monitoring, the last one should failover as it uses bird
[16:36:42] <XioNoX>	 robh: yeah exactly
[16:36:47] <cdanis>	 robh: all down items in eqsin, if you look at https://icinga.wikimedia.org/alerts they are right there
[16:36:56] <cdanis>	 I just ctrl-f'd for '500'
[16:37:22] <robh>	 heh, ok cool
[16:37:28] <robh>	 so yeah... sorry about that =P
[16:37:48] <robh>	 I was all riding high on figuring out wtf happened with relined
[16:37:51] <robh>	 reality didnt like that
[16:37:53] <XioNoX>	 robh: might be worth opening a task to figure out how to improve automation there
[16:38:08] <XioNoX>	 detailing the steps needed, etc
[16:38:14] <robh>	 XioNoX: So i think the overall thing is manual firmware updating is bad and eventually we need to automate, and yeah
[16:38:22] <robh>	 i think there is a task already, i'll add in this event as reason to move that along
[16:38:34] <XioNoX>	 nice, I think John was working on it?
[16:39:36] <slyngs>	 Ah if you're going to break Ganeti let me know, so I can test my Ganeti Prometheus client :-)
[16:40:27] <robh>	 yeah https://phabricator.wikimedia.org/T283771
[16:40:32] <robh>	 i'll append my bork to that today
[16:40:33] <XioNoX>	 slyngs: I'm curious, what will that do/export?
[16:41:44] <slyngs>	 Metrics for capacity management, CPU/RAM availability, number of offline node, distribution of primary and secondary instances
[16:41:47] <robh>	 so those down items are gone now
[16:41:55] <robh>	 are they all doin their services as expected again?
[16:42:05] <XioNoX>	 slyngs: nice!
[16:42:26] <robh>	 dunno how the hell i put in ganeti5003.eqsin.wmnet when i had the tab open for ganeti3003.esams.wmnet... those arent at all the same irght?  ; D
[16:42:42] <robh>	 5 characters different even!
[16:43:02] <sukhe>	 sorry just catching up
[16:43:06] <slyngs>	 I was going to say a large number of the characters are the same
[16:43:35] <robh>	 yeah, when you reboot a host in the OS, it clearly makes you type the hostname
[16:43:47] <robh>	 and via script it outputs the entire hostname and makes you confirm that what you are about to do will break it
[16:44:04] * cdanis forcing icinga re-check on all the purple UNKNOWNs for 500x hosts
[16:44:04] <robh>	 but via manual idrac buttons... relies on me not being inattentive.
[16:44:29] <robh>	 !log ganeti5003 returned to service after accidental reboot
[16:44:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:36] <XioNoX>	 sukhe: is doh5002 back to normal?
[16:44:48] <robh>	 !log ganeti3003 (already depooled) coming down for firmware update and reimage via T308238
[16:44:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:51] <stashbot>	 T308238: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238
[16:45:04] <cdanis>	 this isn't really a big deal tbh -- this is an event that could literally happen on its own at any time
[16:45:30] <slyngs>	 Best to view it as a failover test :-)
[16:45:36] <XioNoX>	 yeah, if a server going down is a big deal, then something should be designed better
[16:45:46] <XioNoX>	 unless it's a DB master :)
[16:45:52] <cdanis>	 eheh
[16:45:56] <cdanis>	 all the purples cleared
[16:46:45] <cdanis>	 https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=doh5002 looks good too
[16:47:02] <cdanis>	 oh!
[16:47:05] <cdanis>	 one last teachable moment here
[16:47:34] <cdanis>	 the next time that anyone, robh or otherwise, has an "oh crap I did something Bad in production" moment, when you post you should also cc the current oncalls
[16:47:44] <robh>	 oh, true, didnt even think of that
[16:48:02] <robh>	 should we keep oncall names in topic maybe or just rely on wikitech?
[16:48:18] <cdanis>	 the topic isn't a bad idea
[16:48:29] <robh>	 we have ops clinic listed in -operations
[16:48:47] <robh>	 so not first time we do such a thing heh
[16:48:48] <cdanis>	 and speaking of being oncall, looking at shellbox now
[16:48:49] <rzl>	 I think the only reason it's not in the topic is it changes 3x a day, and we haven't yet figured out how to give everybody ops easily
[16:48:59] <rzl>	 in -operations I mean
[16:49:02] <sukhe>	 XioNoX: seems to be OK, checking running DNS tests
[16:49:07] <robh>	 hrmm, maybe just link to the wikitech page in topci
[16:49:09] <sukhe>	 all good
[16:49:27] <cdanis>	 rzl: IRC integration 😇
[16:49:28] <rzl>	 we could also just name both NA and both EU oncalls, and then it only changes weekly
[16:49:39] <rzl>	 then you're on your own for figuring out what time it is, but that's not so bad
[16:51:03] <cdanis>	 rzl: do you know anything about shellbox?  who in ustz does?
[16:51:10] <cdanis>	 https://grafana.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=shellbox&var-namespace=shellbox&var-release=main&viewPanel=28
[16:51:20] <rzl>	 k.unal :( let me see what I can figure out though
[16:51:20] <cdanis>	 I am going to guess it is either underprovisioned or overcapacity, given that 'throttled' line
[16:51:44] <herron>	 hmm yeah that'd make sense
[16:51:54] <cdanis>	 I am also fine with simply throwing more replicas at the problem and asking intelligent questions later
[16:51:58] <rzl>	 that smells right, looks like it's regularly spiky but this last spike is much higher
[16:52:05] <rzl>	 and yeah was about to suggest the same
[16:52:18] <XioNoX>	 already going down too
[16:52:38] <rzl>	 Yes, But For How Long™
[16:53:11] <XioNoX>	 of course, was saying that to frame the urgency
[16:53:30] <rzl>	 yeah :)
[16:53:36] <cdanis>	 yeah
[16:53:46] <brett>	 godog: Looking into fixing the daily mails of pontoon being quite out of date on puppet - went to apply and I'm getting "Could not request certificate: The certificate retrieved from the master does not match the agent's private key." - Do you have any advice/caution?
[16:53:47] <cdanis>	 current traffic saturated the cpu reservation and 100% of the php-fpm workers
[16:53:53] <cdanis>	 so ofc the blackbox prober paged
[16:53:57] <brett>	 (tagging you because it says so in https://wikitech.wikimedia.org/wiki/Puppet/Pontoon)
[16:54:34] <rzl>	 yeah, this looks like it was just regular old spiky traffic -- I'll be interested in where it came from, whether it's legit or should be blocked, whether we should expect it long-term, etc
[16:54:43] <rzl>	 but in the meantime if we can just provision for it, let's do that
[16:54:45] <cdanis>	 looks like we have only 5 replicas?
[16:55:46] <cdanis>	 no, sorry, 8
[16:56:26] <cdanis>	 I suggest just doubling that?
[16:56:35] <cdanis>	 https://gerrit.wikimedia.org/g/operations/deployment-charts/+/c12d4dea22aca8178fda4242fce04ee95362cb3b/helmfile.d/services/shellbox/values.yaml#12
[16:56:44] <herron>	 SGTM
[16:58:10] <rzl>	 sounds right, just double-checking to make sure it's the right shellbox
[16:58:55] <cdanis>	 it is port 4008 robh 
[16:58:58] <cdanis>	 s/robh/rzl/
[16:59:06] <rzl>	 yep that's the one
[16:59:32] <rzl>	 and doubling to 16 sounds good, are you doing that or am I?
[16:59:44] <cdanis>	 if it is easy for you i'd appreciate it
[16:59:58] <rzl>	 I think it is, and if it's not that'll be a valuable experience :) doing
[17:00:22] <herron>	 ha I was about to volunteer for the same reasons
[17:00:38] <cdanis>	 I'm about to hop into a 1:1
[17:00:46] <rzl>	 herron: you can get the next one :D
[17:00:48] <rzl>	 cdanis: ack
[17:00:56] <herron>	 cheers cdanis thanks
[17:01:10] <herron>	 ha fair enough rzl
[17:02:12] <rzl>	 mailed https://gerrit.wikimedia.org/r/803953, any stamps?
[17:02:49] <rzl>	 thanks!
[17:02:56] <herron>	 np :)
[17:03:04] <jayme>	 slownp :)
[17:03:28] * rzl await(jenkins)
[17:04:09] <jayme>	 It looks like this was kind of a request spike maybe?
[17:04:19] <jayme>	 from https://grafana.wikimedia.org/d/3SiE86Nnz/mediawiki-shellouts?orgId=1 
[17:04:20] <rzl>	 yeah agreed
[17:04:44] <jayme>	 there is really a huge spike f lilypond requests
[17:04:55] <rzl>	 lilypond spike is consistent with this being shellbox (rather than shellbox-something) since that's the score instance
[17:05:10] <jayme>	 yup
[17:05:26] <jayme>	 the pdfhandler stuff is shellbox-media afaik
[17:12:09] <rzl>	 mm, there were already diffs with production :/ so this "helmfile apply" is going to double the replicas but also upgrade from shellbox 1.0.0 to 1.0.3
[17:12:39] <rzl>	 probably nbd but not ideal -- I'm going to start with staging first, even though I'm not changing the replicas there
[17:13:18] <XioNoX>	 surprise upgrade!
[17:15:48] <rzl>	 I think jayme and I chatted at one point about how it'd be nice to have a process for keeping these diffs clean, or at least warning us if they're dirty for too long
[17:17:15] <jayme>	 looks like that did not directly lead somewhere :)
[17:17:35] <rzl>	 shellbox staging is updated and hasn't burned to the ground, although I don't offhand know a way to test it
[17:20:07] <jayme>	 the wikitech page has nothing more then the healthz check - that has been done by kubernetes for you already
[17:21:09] <rzl>	 yeah, ideally I'd like to give it some test traffic, later I'll see about how to do that and try to document it
[17:21:14] <rzl>	 for now I'm going ahead with codfw
[17:22:44] <rzl>	 we get a steady trickle of requests there, I'm watching https://grafana.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1&var-dc=codfw+prometheus%2Fk8s&var-service=shellbox&var-namespace=shellbox&var-release=main&from=1654705347482&to=1654708947482
[17:24:33] <rzl>	 hm, ProbeDown again, looks like it stopped answering briefly during the rollout
[17:25:11] <cdanis>	 it looks like more saturation tbh
[17:25:33] <cdanis>	 rps picked back up since 17:15 and apache workers busy picked way up at 17:21
[17:26:05] <jayme>	 the downprobe was from eqiad aiui
[17:26:05] <rzl>	 oh man yeah, I didn't even see the traffic in eqiad because I was watching the rollout in codfw
[17:28:37] <rzl>	 codfw looks recovered -- I was surprised to see the request rate is doubled, but of course it is, it's just health checks and there are twice as many
[17:29:32] <rzl>	 going ahead in eqiad now
[17:30:19] <rzl>	 heh, Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress
[17:30:24] <rzl>	 I lost the race with mwdebug it looks like
[17:33:06] <jayme>	 nah, that does not make sense
[17:33:43] <jayme>	 those operations are on a "per release" basis. So more like per chart
[17:34:01] <jayme>	 "per release" with release being the "helm release"
[17:34:23] <rzl>	 hm okay, so there's some phantom operation with shellbox eqiad then
[17:34:31] <jayme>	 yeah
[17:34:34] <jayme>	 let me take a look
[17:35:04] <rzl>	 thanks!
[17:35:42] <jayme>	 uh...
[17:35:58] <cdanis>	 👀
[17:36:16] <jayme>	 try "kube_env admin eqiad"
[17:36:21] <jayme>	 (sudo -i)
[17:36:35] <jayme>	 helm3 -n shellbox list --all
[17:36:47] <jayme>	 without --all it looks frightening :)
[17:37:12] <rzl>	 2022-01-12, eh
[17:37:34] <jayme>	 yeah...no idea
[17:37:43] <cdanis>	 is that when someone started an upgrade that was never finished?
[17:37:48] <jayme>	 "helm3 -n shellbox history main" gives you the history for the main release
[17:37:59] <jayme>	 cdanis: yes. ^C maybe
[17:38:01] <rzl>	 cdanis: evidently yeah, the status is listed as pending-upgrade
[17:38:01] <cdanis>	 I think we've found at least two things we should have alerts for 😅
[17:38:14] <jayme>	 agreed
[17:38:40] <jayme>	 so, way out of this mess is roll back to revision 2 of release main, then re-run the helmfile deployment
[17:39:03] <rzl>	 ah, cool
[17:39:04] <jayme>	 luckily revision 2 is the same chart version than the one currently "pending upgrade"
[17:39:16] <rzl>	 can I roll back via helmfile, or is that still done through helm3 directly?
[17:40:20] <jayme>	 that needs to be done with helmfile
[17:40:24] <jayme>	 äh
[17:40:29] <jayme>	 helm3, sorry
[17:40:46] <rzl>	 okay cool - I see https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency but it's still instructions for helm rather than helm3
[17:41:02] <rzl>	 is it just "helm3 rollback" though?
[17:41:29] <jayme>	 "helm3 -n shellbox main 2"
[17:41:52] <rzl>	 amazing thank you
[17:41:53] <jayme>	 there is a (non-ideal) doc at https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency
[17:42:39] <rzl>	 and I guess I don't have to say eqiad anywhere because that's set through kube_env
[17:43:06] <jayme>	 yes, correct. If you're in doubt run "kubectl get nodes"
[17:43:12] <rzl>	 👍
[17:44:15] <rzl>	 > Rollback was a success! Happy Helming!
[17:44:21] <jayme>	 nice
[17:44:27] <rzl>	 and list --all now shows revision 4, status deployed
[17:44:40] <rzl>	 backing out and rerunning the helmfile apply, then
[17:44:54] <jayme>	 👍
[17:45:05] <rzl>	 ahhh there we go
[17:45:08] <rzl>	 thanks for the handholding!
[17:45:18] <jayme>	 sure thing, yw!
[17:45:56] <cdanis>	 so summarizing followup AIs here
[17:46:13] <cdanis>	 * have an alert for a helm service that has diffs between what's deployed and what is head-of-tree in git
[17:46:31] <cdanis>	 * have an alert for a helm deployment/upgrade/other mutation that has been open for "too long" (days? a week?)
[17:46:53] <jayme>	 #1 is https://phabricator.wikimedia.org/T265979
[17:47:54] <cdanis>	 I guess another question in my head is -- shellbox is not active/active, correct?
[17:48:43] <jayme>	 it is
[17:48:48] <jayme>	 or at least it should be
[17:48:50] <cdanis>	 ah okay cool
[17:49:11] <cdanis>	 so if had to do something more invasive with the eqiad deployment, we could have depooled it there
[17:49:20] <jayme>	 yes
[17:49:50] <cdanis>	 cool, thanks!
[17:50:00] <jayme>	 np
[17:50:01] <cdanis>	 I'm going to take a little break and then I'll file #2
[17:51:17] <rzl>	 janis: check me? https://wikitech.wikimedia.org/w/index.php?title=Kubernetes/Deployments&diff=1987861&oldid=1987818
[17:51:28] <rzl>	 uh, jayme: I mean, the one you actually highlight on
[17:51:55] <jayme>	 I hightlight on both ;)
[17:52:00] <rzl>	 haha oh good
[17:52:05] <jayme>	 looks good, thanks
[17:53:04] <jayme>	 and thanks for filing a task cdanis - I'll drop off. Have a nice day o/
[17:53:13] <rzl>	 good night! thanks again for the help
[18:39:55] <legoktm>	  13:17:35 <rzl> shellbox staging is updated and hasn't burned to the ground, although I don't offhand know a way to test it <-- live hack $wgShellboxUrls on mwdebug to point to staging, then modify one of the Score pages on testwiki (you need to change the content in <score> tags to bypass cache) and see that it renders properly
[18:41:15] <legoktm>	 I still don't have a good sense of how much Shellbox should be resistant to load spikes vs having spare replicas running that won't get used 99% of the time
[18:41:49] <rzl>	 ahh okay cool! thanks for that
[18:42:16] <rzl>	 and yeah, I know we've discussed that question before
[20:04:30] <arnoldokoth>	 Hey cdanis: herron: we are currently performing a rolling restart of memcache in eqiad with an hour between hosts.
[20:06:42] <cdanis>	 ok! sounds good
[20:08:24] <herron>	 rgr that
[20:27:00] <Krinkle>	 > [#wikimedia-releng] <•wikibugs> Deployments, serviceops, Wikimedia-production-error: mw1415 fatals due to serving responses from 1.39.0-wmf.10 (was DBQueryError: Unknown column page_restrictions)
[20:27:29] <Krinkle>	 > <Krinkle> dancy: well, at least it's not serving prod I think? Last sal entry says its depooled.
[20:28:04] <Krinkle>	 I see an empty line at https://config-master.wikimedia.org/pybal/eqiad/appservers-https where it would presumably be
[20:28:07] <Krinkle>	 not even pooled=false
[20:30:25] <mutante>	 Krinkle: when there is an empty line that means it is pooled=inactive
[20:30:29] <mutante>	 like more off than false
[20:30:44] <mutante>	 usually that is done for things like hardware repair
[20:31:21] <Krinkle>	 mutante: ok, does pooled=inactive also means it doesn't receive scap deploy but remains serving apache traffic to pyball and health checks?
[20:31:30] <Krinkle>	 it's problematic to have 5 week old code running zombie in production
[20:31:45] <mutante>	 no, it should mean "not even in config" as in "no traffic AND no scap deploys"
[20:31:55] <mutante>	 that's why there is another alert about mw versions not matching
[20:32:00] <mutante>	 because it didnt get the deploy
[20:32:11] <mutante>	 this is what used to be "not in dsh groups"
[20:32:49] <Krinkle>	 https://alerts.wikimedia.org/?q=%40state%3Dactive&q=instance%3Dmw1415
[20:32:59] <Krinkle>	 > Host mw1415 is not in mediawiki-installation dsh group
[20:33:05] <Krinkle>	 > WARNING: Missing 1 sites from wikiversions. 982 mismatched wikiversions
[20:33:43] <mutante>	 yea, that happens when it's not in scap (dsh) groups
[20:33:46] <Krinkle>	 so.. assuming the alert has not been ignored for a long time, what does that mean exactly. the host came back to live
[20:33:47] <mutante>	 no deployments to that host
[20:34:02] <Krinkle>	 and alert started firing just now?
[20:34:42] <mutante>	 I can't say much about alertmanager, unlike icinga
[20:34:55] <mutante>	 but I would think it can be an expired downtime
[20:35:15] <mutante>	 or someone/something deployed to "only canary hosts" specifically
[20:35:23] <mutante>	 outside the "dsh groups"
[20:35:33] <mutante>	 which translates to confctl settings 
[20:36:13] <mutante>	 a 'scap pull' on the host should fix the "mismatching mw versions" alert separate from getting traffic or not
[20:36:48] <mutante>	 checking SAL for entries about that host
[20:37:21] <Krinkle>	 I don't thnk the MW version warning is something we should ack. like I said, we can't be running 5-week old code in production. If this is normal, I thikn we need to make it so that this state results in apache being turned off or healthcecks being skipped or something. This intermediary state of the host clearly being up and serving apache pings whilst not getting code updates seems like a state that should be impossible.
[20:37:53] <Krinkle>	 but then again, I think that's mostly true already, just something went wrong here
[20:38:31] <mutante>	 a host that is not in confctl should not get any traffic
[20:38:46] <mutante>	 but it does not mean monitoring is tied to it
[20:39:11] <mutante>	 alertmanager/icinga probably has no idea whether a host is in dsh groups (getting traffic). it just keeps checking
[20:39:37] <Krinkle>	 I assume `/w/api.php?action=query&meta=siteinfo&format=json&formatversion=2` requests are coming from pyball, right?
[20:39:46] <Krinkle>	 these are part of the healthchecks?
[20:39:47] <mutante>	 I don't know why this host was removed. Usually that would happen just when hardware breaks
[20:40:05] <Krinkle>	 - /wiki/Main_Page  as well
[20:40:18] <Krinkle>	 those two basically are taking turns every 5 second triggering a fatal error
[20:40:30] <mutante>	 this _seems_ to indicate it just happened today https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=mw1415&service=mediawiki-installation+DSH+group
[20:40:36] <mutante>	 but there is nothing in SAL
[20:41:15] <mutante>	 modules/nagios_common/files/check_commands/check_etcd_mw_config_lastindex.py:    url = 'http://{host}/w/api.php?action=query&meta=siteinfo&format=json&formatversion=2'.format(
[20:41:38] <mutante>	 seems to be this 'check_etcd_mw_config" thing
[20:42:37] <cdanis>	 Krinkle: yes, I believe so
[20:42:57] <Krinkle>	 so I've run `scap pull` by hand for now to silence the logspam
[20:43:06] <mutante>	 thing is.. I don't know why it was set to inactive
[20:43:11] <cdanis>	 Krinkle: did you find this in apache logs?  there should be a user-agent
[20:43:23] <mutante>	 I was about to ask if you want me to just scap pull
[20:43:29] <mutante>	 then we can set it to pooled=no again
[20:43:36] <mutante>	 which will assure it will get scap deploys
[20:43:48] <mutante>	 and clear more alerts
[20:43:49] <Krinkle>	 but I think it's worth following up here so that we can figure out a strategy that does not involve it being part of the normal runbook to leave a host such that it is alive but not receiving code updates and actively serving old MW code connecting to production memc/mysql in response to health checks.
[20:44:07] <Krinkle>	 the fatal errors are actually a good thing, could've been worse if it e.g. silentely corrupted stuff in memc or mysql. 
[20:45:14] <cdanis>	 do we have a defined backwards compatibility window for that?
[20:45:28] <mutante>	 It doesn't make sense that the icinga alert history said it never alerted about "not in dsh groups" until today.. while we also say it was actively serving weeks old code.
[20:45:43] <Krinkle>	 it sounds like pooled=inactive consistently creates this result, which presumalby has a valid use case for being a third state, but we'll need to handle that better in some sense.
[20:45:51] <mutante>	 either it was in that state or not
[20:46:32] <mutante>	 I don't think it's actively serving old code.. except that monitoring checks won't stop checking
[20:46:49] <cdanis>	 https://phabricator.wikimedia.org/T307755
[20:46:54] <cdanis>	 it was broken?
[20:46:56] <Krinkle>	 cdanis: for trains, it's 1 week, but there are also regular depoyments that backport changes to all branches and then flip a config flag with the expectation that within ~5min things propagate. the same for e.g. schema changes and maintenane jobs, after a few minutes we expect no new processes to start with the old code.
[20:46:58] <mutante>	 more concering to me is _why_ it was set to inactive without SAL
[20:47:01] <cdanis>	 and looks like it got fixed today
[20:47:32] <cdanis>	 prometheus data backs this up too
[20:47:36] <cdanis>	 https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=mw1415&var-datasource=thanos&var-cluster=appserver&from=now-90d&to=now
[20:47:44] <cdanis>	 it looks like it has been dead since the 6th of May
[20:47:46] <mutante>	 pooled=inactive has always meant "no scap AND no traffic"
[20:47:54] <mutante>	 ooh.. of course.. that is the host that we called Dell for.. duh
[20:48:00] <cdanis>	 you filed the ticket :)
[20:48:01] <mutante>	 the Dell tech was there
[20:48:21] <Krinkle>	 ack, if a host is unresponsive or broken, it makes sense to take it out of dsh as otherwise scap will prompt deployers with something to respond to, separation of concerns etc.
[20:48:35] <mutante>	 yea, there are many tickets. doesn't mean I knew the tech would fix that today
[20:48:55] <cdanis>	 I don't think we can realistically do what you're saying Krinkle
[20:48:56] <mutante>	 yea, Krinkle. that is exactly why. because otherwise deployers will get errors every time
[20:49:12] <cdanis>	 the hardware of a host breaks suddenly, no warning
[20:49:40] <mutante>	 I think we would need a workflow where dcops changes pool state.
[20:49:40] <cdanis>	 what do we do once the hardware is fixed -- or even to test that the hardware is fixed?  do we firewall off its IP from the memcacheds and the mysql servers?
[20:49:49] <mutante>	 or has to schedule everything with us
[20:50:26] <mutante>	 this is kind of the opposite of unexpected failure. unexpected fix 
[20:50:43] <dancy>	 Haha yes
[20:50:52] <dancy>	 Damn fixes
[20:51:01] <Krinkle>	 cdanis: ack, reminds me a bit of what we do with mysql, where afaik we after reboot start with service off and/or read-only.
[20:51:19] <Krinkle>	 assuming a reboot happened as part of this unexpected fix, perhaps that's reasonable to start with, until an SRE can repool it. 
[20:51:33] <Krinkle>	 the runbooks already say that we need to run scap pull by hand before repooling it 
[20:51:46] <cdanis>	 we could do something like that, sure, but in general I think serviceops has been of the belief that if a host isn't receiving production traffic, it isn't going to perform any mutations against datastores
[20:52:12] <cdanis>	 I guess Main_Page can cause parsercache writes and memcached writes?
[20:52:23] <Krinkle>	 indeed
[20:52:41] <Krinkle>	 and as part of filling in caches we sometimes do db writes e.g. link table migration and actor migration result in lazy populating db rows when things are absent
[20:52:48] <Krinkle>	 which can go on for months while we migrate schemas
[20:53:46] <Krinkle>	 the good news is that the healthcheck URLs are naturally so common that anything they do will have been done already presumably 
[20:53:55] <cdanis>	 naturally
[20:54:19] <Krinkle>	 so perhaps it's good enough that we 1) get the alert soon enough about outdated MW version and lack of dsh, and 2) then respond to that by doing at least a scap pull and pooled=no
[20:54:26] <cdanis>	 so, I don't think that we should have mediawiki not start up upon reboot
[20:54:34] <Krinkle>	 in this case that did not happen within ~2h though
[20:55:14] <cdanis>	 here's a hot take: I don't think this issue is worth worrying about, and all of these warts we have around scap and code versions and how Mediawiki runs will go away once it is on k8s
[20:55:34] <Krinkle>	 :)
[20:55:55] <cdanis>	 this is a small one of many steps that would have to happen to cause some sort of disaster corruption in mysql or memcached
[20:56:27] <cdanis>	 (and, if that did happen, we had better be able to handle it anyway!)
[20:56:51] <Krinkle>	 zooming out: the immediate issue was logspam and confusion for developers running deploymentes and monitoring their components in prod. the response was to file a prod error, and ~1h of investigation to figure out that it is specific to a host that came back to life
[20:57:40] <cdanis>	 so for me it was very obvious what had likely happened as soon as I looked at https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=mw1415&var-datasource=thanos&var-cluster=appserver&from=now-24h&to=now
[20:57:44] <Krinkle>	 would the normal procedure had been that e.g. within the next hour an SRE would have seen the alert and run scap pull and set pooled=no as part of a documented runbook? That might be a good solution in the interim.
[21:01:12] <cdanis>	 what was the alert that fired?
[21:01:51] <Krinkle>	 four alerts are firing: https://alerts.wikimedia.org/?q=%40state%3Dactive&q=instance%3Dmw1415 Etcd status, MW version, PHP HTTP 500, Apache HTTP 500. 
[21:02:17] <Krinkle>	 as of 1 hour ago, since ~30min since the host came online it seems
[21:02:24] <cdanis>	 wikiversions?
[21:02:51] <Krinkle>	 yeah that one too I think
[21:03:05] <Krinkle>	 I don't see it now though
[21:03:14] <Krinkle>	 they should be recovering soon since I ran scap pull
[21:03:30] <cdanis>	 https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=mw1415&service=Ensure+local+MW+versions+match+expected+deployment
[21:03:32] <Krinkle>	 I have not changed any pooling though.
[21:03:33] <cdanis>	 so it fired in an odd way
[21:03:46] <cdanis>	 ahh okay
[21:03:49] <cdanis>	 "Missing 1 sites from wikiversions. 982 mismatched wikiversions"
[21:04:11] <cdanis>	 so it had been so long we added a new wiki, and, we had all of the rest mismatched
[21:04:41] <cdanis>	 that will be noticed eventually, but
[21:04:48] <cdanis>	 * only during business hours, with no guarantees
[21:05:06] <cdanis>	 * it's a very noisy alert, and last I saw often fires a lot around deployment time anyway, so it is often ignored
[21:05:42] <mutante>	 rescheduling the alerts in icinga, hold on
[21:05:53] <mutante>	 one should clear just because you ran scap pull
[21:07:30] <Krinkle>	 "often fires a lot around deployment time" - that's a problem :)
[21:07:37] <Krinkle>	 I see it also lacks a runbook page
[21:07:39] <cdanis>	 yes
[21:07:44] <Krinkle>	 alert points to https://wikitech.wikimedia.org/wiki/Application_servers
[21:07:54] <cdanis>	 also, when it fires, it generally fires on many, many hosts at once and floods the channel
[21:07:56] <cdanis>	 :)
[21:08:04] <Krinkle>	 but "Host mw1415 is not in mediawiki-installation dsh group" is a better one I think
[21:08:17] <Krinkle>	 mw versions is fairly weak as it doesn't consider intra-week changes or config changes etc.
[21:08:21] <Krinkle>	 dsh group will capture everything
[21:08:37] <Krinkle>	 how, if at all, would that one have been responded to?
[21:09:17] <cdanis>	 "serviceops will likely look at it eventually" is the best answer I have :)
[21:09:26] <mutante>	 it would have been ACKed until the hardware repair ticket gets updated
[21:09:38] <cdanis>	 there's also that, yes
[21:13:36] <mutante>	 !log mw1415 - scap pull, restart apache, /usr/local/sbin/restart-php7.2-fpm (INFO: The server is depooled from all services. Restarting the service directly)
[21:13:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:14:04] <cdanis>	 I need to go afk but happy to continue this conversation later
[21:16:00] <mutante>	 Apache alert recovered
[21:16:02] <Krinkle>	 I've boldly updated https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups for now. The last part of that is probably wrong as I'm not sure what the procedure is around hardware repairs etc. maybe that is redundant. Edit at all :)
[21:16:03] <mutante>	 now CRITICAL - degraded: The following units failed: prometheus-phpfpm-statustext-textfile.service
[21:16:16] <mutante>	 trying to clear that up as well
[21:16:39] <Krinkle>	 Edit at will*
[21:17:22] <mutante>	 setting to pooled=no after apache/php is green
[21:17:37] <mutante>	 waiting for "dsh groups" alert to recover after that
[21:17:53] <mutante>	 21:14 <+icinga-wm> RECOVERY - PHP7 rendering on mw1415 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:17:57] <mutante>	 21:14 <+icinga-wm> RECOVERY - Apache HTTP on mw1415 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:20:18] <mutante>	 Krinkle: it's back on https://config-master.wikimedia.org/pybal/eqiad/appservers-https
[21:20:22] <mutante>	 no more missing line
[21:20:52] <Krinkle>	 ack, thanks!
[21:21:04] <mutante>	 https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=mw1415&scroll=178  is all GREEN again
[21:21:26] <mutante>	 that means to me we can now pooled=yes
[21:21:30] <mutante>	 just like before it broke
[21:22:08] <mutante>	 and call the hardware repair ticket resolved. separate from workflow optimization questions
[21:22:27] <Krinkle>	 My feeling is that the way I connected the dots here from happening to see the Phab ticket to jumping in -sre is the part that was off-script and something others likely would not have done. I don't know how long it would have otherwise blocked or confused train/releng/scap. 
[21:23:46] <mutante>	 what is the first thing you noticed?
[21:23:58] <mutante>	 how did it start for you
[21:24:27] <mutante>	 21:20 <+icinga-wm> RECOVERY - mediawiki-installation DSH group on mw1415 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[21:29:51] <Krinkle>	 mutante: https://phabricator.wikimedia.org/T310225 is how it started for me, or rather the spike in Logstash via dancy 
[21:30:07] <Krinkle>	 fatal db error about missing db column
[21:37:13] <mutante>	 ok.. so ... one might argue it's a monitoring notification issue though.. I _did_ see the monitoring alert and that made me ping. and that in return caused that ticket
[21:37:25] <mutante>	 doesn't mean I would always watch IRC though of course
[21:38:43] <mutante>	 so imho it comes down to.. either notification methods of alerting ..or workflow change where dcops can set a status
[21:38:54] <mutante>	 but even that wouldn't help alone because there would be no "correct" status for it
[21:39:46] <mutante>	 let me actually pool that now
[21:51:02] <mutante>	 https://phabricator.wikimedia.org/T310225#7990630
[21:51:15] <mutante>	 closed the hardware repair ticket, left summary on the other one
[21:51:35] <mutante>	 server pooled like before it broke