[03:23:20] cwhite: This is perfect. I found a lot of what I need in that page:) [08:00:18] looks to be OK now...? [08:02:10] <_joe_> Emperor: yes, and a netsplit hardly gets resolved by restarting a bot either [08:08:20] _joe_: I don't know much about logmsgbot (other than what systemctl service to eyeball on icinga.wm.org now ;) ), but I've encountered irc robots that got confused by things like netsplits in the past :) [08:09:13] <_joe_> Emperor: yeah but in the specific case, it just ended in the half of the netsplit where we weren't [08:22:52] It was off for half an hour from what I can see [08:23:17] Restarting just causes them to hopefully pick another server in the rotation [08:23:36] which stops it being split if it doesn't notice itself [09:01:59] https://www.mediawiki.org/wiki/Wikimedia_Performance_Team/Web_Perf_Hero_award#Amir_Sarabadani [09:02:00] congratulations Amir1 [09:02:22] * Amir1 blushes [09:02:45] congrats :) [09:04:42] nice work Amir1 :) [09:06:21] <3 [09:16:23] very nice! [09:37:08] <_joe_> Amir1: given the work you're doing, I expect you to win performance MVP for at least the next couple years [09:37:21] <_joe_> anything less would be very disappointing tbh [09:37:23] <_joe_> :P [09:37:25] haha [09:37:27] <_joe_> (congrats) [09:37:32] xD [09:37:42] <_joe_> no pressure heh [09:46:14] go Amir1 :) [09:47:29] woot woot, well done Amir1 ! [09:52:34] congrats! [09:56:53] Thanks <3 [10:44:02] congrats Amir1 [10:46:25] Thanks ^^ [11:04:48] jbond: do you have any fixes for the spdx rake tasks throwing 'NoMethodError: undefined method `escape' for URI:Module'? that's with ruby 3.0 as packaged in debian testing [11:07:44] taavi: unfortunatly not, puppet is not compatible with ruby 3.0 untill we get to puppet 7 [11:08:02] i think the other users on sid have created a container (cc moritzm ) [11:08:39] hmm although we could probably monkey patch it in let me try [11:09:38] ill play with monkey patching later today, that may create a hacky way forward. [11:09:45] volans: when you have a minute, can you review https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/786960 please? [11:09:54] * jbond needs to fist create a sid machine [11:09:59] yeah, exactly I'm simply using it in a bullseye systemd-nspawn container for now [11:10:46] dcaro: sure, sorry if it was blocked on me, I might have missed the latest PS [11:12:41] np :) [11:12:54] kudos Amir1! [11:15:25] dcaro: {done} [11:15:58] thanks! [11:17:08] * volans going afk for lunch, no need to wait for me ;) [11:27:25] Hi SRE, I am trying to connect to WMF stat machines (bast5002), and running to an SSH error: "channel 2: open failed: connect failed: Name or service not known" I tried connecting directly to the bastion, changing networks, other bastions, and clearing DNS cache; none of them worked. Can anyone help with this? Thanks in advance. [11:27:56] Are you using bast5002.wikimedia.org ? [11:28:09] yes [11:28:38] works for me :/, can you pass the ssh command you are using? [11:29:00] ssh -N stat1005.eqiad.wmnet -L 8880:127.0.0.:8880 [11:30:13] you are missing a '1' in the -L option (127.0.0.*1*) [11:30:25] (not sure if that's the issue though xd) [11:31:10] ah, thank you!:) that is the issue! [11:31:35] \o/ (typos are 90% of my issues too :) ) [11:33:51] typos! [12:02:14] hmmm cp3050 is having some issues with NRPE checks [12:02:30] vgutierrez: looking i just pushed a change [12:03:00] same for db1131 and es2032 [12:04:53] vgutierrez: reverted [12:06:06] hmmm I wonder what went wrong with it [12:06:28] taavi: no sure but for a lot of checks the command was aqn empty string [12:06:57] jbond: that fixed it indeed, thx [12:07:37] althoug a second run on cp3050 brought in the correct string so may be a bit more complicated then that. Will dig into it a bit more this afternoon [12:33:39] dcaro: Thanks! [12:44:31] Amir1: congrats <3 [12:47:48] cdanis: <3 [16:54:22] Forgive me if this is already handled but has someone looked into the deluge of swift-recon-cron mails? If not I'd be interested in looking into it :) [16:57:25] brett: I haven't looked at them, but I know ms-be1* hosts are under maintenance for upgrade/hw issues [16:58:39] #wikimedia-data-persistence could be a way to ask further [16:59:47] I also remember yesterday there was some alerting with thanos-swift [17:00:23] brett: also see ms-be warnings in Icinga (maybe). : Exec[mkfs-/dev/sdc1],Exec[mkfs-/dev/sdd1]. https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=4&hoststatustypes=3&serviceprops=42 [17:01:08] failed to make the file system? [17:01:42] As far as I know, puppet sometimes accidentally wipes entire partitions on reimage, hence the noise [17:02:19] (for media server backends) [17:02:47] O.o [17:03:02] so I usually let media storage owners handle it and not add more pressure 0:-) [17:04:04] brett: don't worry, I am the person in charge of media storage recovery process and I am not worried, there is enough redundancy AND backup redundancy :-D [17:04:29] What if puppet gets hungrier and starts eating those too? :D [17:05:00] (>@.@)> 1 0 1 0 1 0 [17:05:39] not to worried because it would take > a week to delete all the files [17:06:02] but we have a plan for a plan C on AWS (offsite offline backups) [17:21:38] (I wasn't serious, just making stupid jokes) [17:27:16] actually, I think it is a good point [17:27:51] that is why we have a plan for it- have something that wouldn't be affected in the case of a root compromise [19:02:13] Is there planned work happening in eqiad? A bunch of cloud* servers just stopped responding [19:03:17] 19:02 <+icinga-wm> PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Connect - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:03:21] 19:03 <+icinga-wm> RECOVERY - Host cloudgw1002 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [19:03:27] andrewbogott: back maybe? ^ [19:03:40] Maybe? Would like to know what happened... [19:04:00] I don't think things are back up, I can't reach cloudvirt1029.eqiad.wmnet for example [19:05:28] topranks: ^ a BGP alert..then this report.. and now BGP alert recovered [19:06:12] andrewbogott: v6 still seems affected [19:09:03] mutante: thanks yes I am looking at it cheers [19:09:07] it seems sort of like things are recovering now but I'm not 100% convinced. Going to keep my prior appointment but please ping me if there's anything I can help diagnose. [19:09:09] thank you topranks ! [19:10:27] andrewbogott: i just got on cloudvirt1029 again [19:10:30] a second ago [19:10:44] v6 still broken for me [19:10:52] the ASN in the error message.. was ..us [19:11:27] (because the runbook says to check the link on peeringdb. so that would be https://www.peeringdb.com/asn/14907) [19:12:26] "if warning/yellow .. tag/ping netops" -- done [19:16:18] It's an issue with RAs from the swithc being blocked [19:16:23] give me a moment [19:18:39] ok should have things starting to come back on v6 now [19:24:50] mutante: thanks for checking! as we start to use BGP more internally we should probably update that documentation, it's correct for internet peering but peeringdb wouldn't come into play here, as the BGP session was between our own switches and core routers. [19:25:32] topranks: I was kind of thinking that probably your response would be "of course it is US" in this case. it seemed like internal..but yea :) [19:25:42] just followed the link [20:01:42] topranks: I'm back from my appointment now (briefly) -- everything looks cleared up, are there any lessons to learn or action items? [20:05:10] andrewbogott: Glad to hear things seem cleared up. [20:06:12] Lesson for me is to move slower and do more checks as you go. I made a change that broke the IPv6 router-advertisements, but as it takes 10 mins for that to take effect everything looked ok right after and I moved on and did the other side. [20:09:51] In terms of action items I don't think there are any right now. The problem related to the migration we are doing, not steps we ever do normally. [20:10:33] I'll think it over though and see if there is anything we might try to do. [20:50:03] If anyone is bored, I could use prod root help with generating an x509 cert for the k8s cluster -- https://phabricator.wikimedia.org/T297140#7958287 [21:24:35] bd808 still need help? I haven't touched the k8s but I do have the access and I have used cergen [21:25:12] inflatador: I do still need help, and you can't make it worse than the current non-functional state. :) [21:26:09] sounds good, I'm pairing w ryankemper if you feel like jumping in meet.google.com/caw-qzat-kan