[08:58:15] I'll just temporarily disable Puppet on Bullseye hosts, while testing a new apt repo [09:05:05] Puppet is re-enabled. Minus two hosts, which seems to be mad at me [09:09:35] I would like to hand out bonus points for Cumin, it is amazingly fast [09:43:46] slyngs: you would love it more if you were around in the times of salt [09:44:35] sale (or our setup of it) came with a feature of "heristic execution" - sometimes it ran, sometimes it didn't (silently) [09:44:39] *salt [09:45:13] I previously used Ansible, and as great as that is for a small number of hosts, it [09:45:31] it's not great once you pass 10 or so [09:46:05] yeah, cumin is great because it was built by v*lans with out needs in mind [09:46:10] *our [09:46:53] and he searched for alternatives first to avoid self-maintained but there was nothing really covering them [09:47:08] (ingegration with puppet, speed, reliability, etc) [09:47:47] Other than speed, I do like that it doesn't just give up, just because one or two hosts are misbehaving [09:48:22] it is actually configurable- e.g. if you want to abort when 1 fails or try no matter what [09:49:43] -p I think it is? [09:51:46] I am going to give a look at both centralog hosts, it seems rotation didn't empty space as usual, there are some alerts ongoing [09:57:43] Filippo beat me to it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/806166 [10:00:27] yeah deploying now [10:02:05] I am pasting the data for context on the ticket, but leaving it to you [10:03:08] ack [10:07:15] I belive dbstore1003:s7 may be missconfigured as "core" (mw) - it is catching up replication after some issue was detected [10:08:07] but creating the ilusion of a lot of s7 write traffic on mw- which is not the case, FYI [10:09:27] e.g.: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&viewPanel=37&from=1655352551270&to=1655374151270 [10:10:23] jynus: Anything I can do to help un-mis-configure dbstore1003:s7 ? [10:10:33] not really [10:11:02] the problem is configuration is on a database, and there is not a lot of checks [10:11:28] ideally we would have a dashboard for that or something to manually reclasify instances [10:12:21] (it is probably not the only mistake, so some full auditing will be needed at some point) [10:12:28] 👍 [10:12:50] e.g. wikidata people asked us about cloud hosts and backup hosts with bad latency [10:13:09] and it was because they were missclassified as core [10:13:12] too [10:13:43] if it was me, I would try to drop "core" and use mediawiki, as I belive core is a very ambiguous term [10:14:05] analytics dbs are as important as core, they just don't server mw production traffic [10:14:10] *serve [10:15:01] that (server classification) and grant handling I think need a rethink, but it won't be a fast fix :-( [10:15:33] I agree. Neither is an easy task. [10:16:37] it is complicated because it is 2 dimensions - section and "usage" [10:19:55] plus existing tools like puppet are not great for that- as it is a host config vs mysql instance config (which can and usually multiplex per host) [11:40:13] slyngs: hey, looks like https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/16ae4489588c5caffc9b24b7351f354ea1bc7025%5E%21/#F0 made apt-get update hang on bullseye cloud vps instances [11:41:01] taavi: I'll take a look, maybe we need to exclude it from cloud for now [11:41:25] oh yes, we certainly need to do that [11:42:05] thanks! indeed 'private' does't sound like something that you want installed cloud-wide [11:44:53] slyngs: we can pass a new option to profile::apt (like $use_private_repo) and then set hieradata/cloud.yaml to false and enable it for hieradata/common/profile/apt.yaml [11:45:31] Perfect, I do a quick patch [11:57:05] moritzm: The apt source are 100% Puppet managed, so Puppet should automatically remove the private repo, when it's no longer required? [11:57:34] slyngs: thats correct [11:57:43] Perfect, in that case: https://gerrit.wikimedia.org/r/c/operations/puppet/+/806197 [11:58:03] yeah,this is enabled for both WMCS and prod via profile::apt::manage_apt_source [12:01:49] moritzm: I'll just fix the comma, before doing the Puppet merge, sorry, didn't spot that [12:03:16] taavi: Should be fixed in the next puppet run [12:05:11] Didn't seem to break additional stuff anyway :-) [12:08:30] nice :-) [12:18:26] slyngs: it doesn't seem to get removed from instances where it was already added [12:19:54] Well that's annoying :-) Just a sec. we'll just ask puppet to remove it. [12:26:09] oh, I mixed up the profile parameter earlier. profile::apt::purge_sources is the relevant one and that is in fact not enabled for WMCS [12:28:06] but to clean this up we can add an absented apt config to profile::wmcs::instance (and then eventually drop that once puppet ran) [12:28:11] or even simpler: [12:28:38] remove /etc/apt/sources.list.d/wikimedia-private via the WMCS Cumin installation [12:28:45] remove /etc/apt/sources.list.d/wikimedia-private.list via the WMCS Cumin installation [12:40:42] Okay, otherwise this should remove it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/806206 [12:55:37] taavi: Should be "more" fixed now [12:56:20] indeed, thank you! [12:56:30] Thank you for point it out :-) [13:09:17] Reedy: can I get a little help debugging a mediawiki vs. database issue? Could be now, or I could make you a phab task. (Or you could suggest some other #channel where I should look for help, I don't really know where the MW folks lurk these days) [13:09:33] andrewbogott: Maybe... Whats up? :P [13:09:35] * andrewbogott remembers to just ask [13:09:55] It's not so much what's up as what's down: https://labtestwikitech.wikimedia.org/ [13:10:16] I've confirmed that I can connect reliably to the database from the host using the creds that mediawiki needs to use [13:10:18] make sure the password gets changedd :P [13:10:36] Is 'went away' a creds issue? I would've expected something more explicit [13:10:37] Please tell me that's not the password being leaked [13:10:40] andrewbogott: everyone can see that [13:10:46] marostegui: yes it likely is [13:11:11] shit, that page looks 100% different from when I last looked at it :( [13:11:15] ok, so now we have two problems [13:11:19] Jesus [13:11:42] I am in bed with 38 fever [13:12:00] marostegui: maybe worth if someone pages [13:12:15] Please rest if you are unwell [13:12:18] I can try to rotate but there are folks who are better qualified to do so [13:12:30] Let me check if that is the OLD password [13:13:01] let's move to a more appropriate channel [13:13:05] yes [14:28:46] rzl: a week or so ago we made our first incident report in data engineering! most of the actionables are done. btullis is wondering if there is anythign else we should do with the report atm [14:28:50] https://wikitech.wikimedia.org/wiki/Incidents/2022-05-31_Analytics_Data_Lake_-_Hadoop_Namenode_failure [14:28:56] do you know? [14:46:26] IF you want to practice- volunteer to come to some of our postmortem sessions! [14:47:23] ottomata: https://wikitech.wikimedia.org/wiki/Incident_review_ritual [14:47:49] We had cloud team recently, could be nice, but it is not "compulsory" [14:51:10] jynus: I'd happily volunteer on behalf of data engineering, but unfortunately that Monday timeslot is often tricky for me to make. [14:53:25] so the strategy is to first expand it on SRE team, hopefully the whole things then gets expanded to technology and other teams too, if seen as a good idea [14:55:22] 👍 Thanks [14:56:49] congrats on the writing, it is already more extensive than most reports! [15:06:39] Thank you. ottomoata did the bulk of it. [15:09:47] one thing I recently discussed with sobanski is that we as engineers should focus -specially on the heading- less on the technical details and more on the actual, human-readable impact [15:11:05] although it is true that for analytics infrastructure your end users and in some cases engineers, but we all often make this mistake of focusing on buzzwords and not people :-) [15:21:56] Yes I see what you mean. We went straight to root cause analysis and technical detail, without much description of the impact and the users affected. [15:23:10] don't worry, I was thinking aloud about other cases, not yours [15:36:10] godog: not sure if still around, but I am cross referencing incident reports- should this be https://wikitech.wikimedia.org/wiki/Incidents/2022-06-14_overload_varnish_/_haproxy with a 2022-06-10 date in the title better? (I can rename it, just checking with you- Thank you for creating it!) [15:37:28] I will be bold and just rename it- I can return it back to the original place if needed [15:54:09] jynus: yes you are correct! thank you for keeping an eye out [16:43:05] Hi SREs! Can someone make a copy of the deploy1002's puppet logs from the last 2 days into a place that I can read them? [16:53:24] godog: btw, as of yesterday the incident status form has a date input now as well, for easier retrofiling without having to rename afterward (or fiddle with tittle query between form submit and saving new page) [16:53:35] hth :) [16:57:18] dancy: last 24h can be seen on puppetboard: https://puppetboard.wikimedia.org/node/deploy1002.eqiad.wmnet (left column, each row is a puppet run) [16:58:21] I get "Service access denied due to missing privileges." when I enter my login info on the idp.wikimedia.org page. [16:59:53] the problem with puppet logs is that they include diffs and could potentially contain secrets (ideally not but might happen) [17:00:37] I don't recall which ldap group puppetboard requires [17:02:13] volans: cn=ops [17:02:47] yeah ops or sre-admins [17:05:03] ok, so that won't work out for me. [17:10:21] dancy: what are you looking for exactly? Maybe I can suggest alternatives [17:12:53] I'm trying to debug the problem described in T310740. I inspected the relevant puppet configurations and it seems like the bootstrap-scap-targets resource should be executed whenever /etc/dsh/group/scap_targets changes but that does not seem to be the case. So I'm hoping to find warnings or error messages related to that. [17:12:54] T310740: scap-o-scap: Bootstrapping a new host fails - https://phabricator.wikimedia.org/T310740 [17:14:05] AFAICT that is failing puppet on a target host, not deploy1002 [17:14:34] The failing puppet is failing because something that should have happened via puppet on deploy1002 didn't happen. [17:16:01] are you saying that to bootstrap scap on a target host puppet needs to run first on the deploy host? [17:16:18] yes [17:16:22] as it stands. [17:16:36] Needs improvement but it's where we are right now. [17:17:02] does this means that all reimages of a host that is a scap target will fail the first puppet run? [17:17:40] It will be a race between when puppet runs on the new target and when it runs on deploy1002. [17:18:21] is this based on exported resources? [17:22:41] * dancy reads about puppet exported resources. [17:22:52] It doesn't appear so but it looks like that might be the right approach, yes? [17:23:00] volans: yes it is [17:23:36] well, actually not [17:23:56] but it's based on puppetdb data on what's using the scap::target or mediawiki::scap classes, see profile::scap::dsh [17:28:22] can I ask where (as in phab task, design doc) was this approach decided? It doesn't seem to take into account the first provisioning of hosts IMHO [17:28:42] but, in the meanwhile, to answer you question [17:30:22] a puppet run added ml-cache1002 to /etc/dsh/group/scap_targets at Jun 16 2022 - 11:45:11 [17:30:47] puppet runs on m-cache1002 kept failing at 11:55:19 and 11:58:50 [17:30:53] the one at 12:03:17 was successful [17:31:54] Do you see any evidence that the `bootstrap-scap-target` exec resource fired after /etc/dsh/group/scap_targets was updated? [17:32:29] Exec[bootstrap-scap-targets] success [17:33:40] (sorry times abover are UTC+2 because the UI gives them in my local time) [17:34:01] this is teh actual log line [17:34:01] Jun 16 09:45:47 deploy1002 puppet-agent[31807]: (/Stage[main]/Profile::Mediawiki::Deployment::Server/Exec[bootstrap-scap-targets]) Triggered 'refresh' from 1 event [17:34:22] and the previous one was: [17:34:23] Jun 16 09:45:11 deploy1002 puppet-agent[31807]: (/Stage[main]/Scap::Dsh/Scap::Dsh::Group[scap_targets]/File[/etc/dsh/group/scap_targets]) Scheduling refresh of Exec[bo [17:34:26] otstrap-scap-targets] [17:34:55] also worth mentioning dancy that all puppet runs after the one at 9:45 UTC on deploy1002 didn't change anything [17:35:08] on the host until UTC afternoon [17:35:17] so way after ml-cache was workng [17:37:30] ottomata: looks great! I agree with jynus that it'd be good to add a one-sentence description of What Happened And How Much It Mattered To Users, framed for someone who doesn't know your infra -- but if you mean in terms of incident-reporting process, I think you're done but I'm not the guy :) [17:39:29] Krinkle: sweet! thank you <3 <3 [17:56:58] dancy: I've replied in https://phabricator.wikimedia.org/T310740#8009798 with what I have [18:01:22] Thanks Volans. I have an idea for making the bootstrapping of scap more reliable. I will discuss with my team. [18:26:23] thanks rzl! oh who is the guy? [18:31:56] onfire team is the guy :) but as jynus says, we're getting good with a lot of those new processes in the core SRE team before thinking too hard about expanding them out, so I'm not sure how much there is for you to do -- anyone from the onfire working group can tell you more [18:32:19] (cc lmata to elaborate) [18:43:05] sorry, I haven't kept up with phab latest bug, was a deploy done and did it work? cannot find the ticket number now [18:43:58] ah, I found it: https://phabricator.wikimedia.org/T310742 [18:44:04] seems resolved [18:46:27] we are trying to decom a VM, cookbook tells ganeti to shut it down.. then it just sits there. manually going to ganeti server, using "gnt-instance info .." on it.. also just sits there.. no response..sigh [18:47:23] ah. Could not shutdown block device disk1/ and then it continues at some point [18:48:10] Started forced sync of VMs ... looks like we are good now [18:57:36] ottomata: the report looks great, I might echo jynus’s recommendation to consider an edit to the summary in the metadata section to something less technical / user impact. That said, also thinking longer term, would it be interesting to you and team to join us in the next ONFIRE session to chat? [18:59:12] lmata: let's discuss next meeting on the agenda adding some header with that so it happens naturally (or another way to encourage people to do it) as it is a very common thing that happens [18:59:51] that suggestion was in the OG template as well :) > Summary of what happened, in one or two paragraphs. Avoid assuming deep knowledge of the systems here, and try to differentiate between proximate causes and root causes. [19:00:13] > Do not assume the reader knows what your service is or who uses it. [19:00:27] yeah, but it it is not working, we should give it a spin somehow 0:-) [19:00:42] e.g. most people may not be reading the docs :-D [19:02:12] my conclusion a few years ago was that this wasn't a documentation problem, it was a social problem [19:03:07] the DNS cookbook failed because: [19:03:11] fatal: unable to access 'https://netbox1002.eqiad.wmnet/dns.git/': The requested URL returned error: 403 [19:04:46] it did update netbox2002 though [19:04:58] sounds related to the recent upgrade [19:09:54] cdanis: I agree :-D [19:55:02] lmata: sure! [19:55:15] maybe me, btullis and joal? [20:36:32] ack