[00:35:21] Legoktm: thanks! That patch looks promising. I will try it out tomorrow. [00:36:39] woot :) [05:36:07] volans: I was decommissioning a codfw host and it failed on the netbox step: https://phabricator.wikimedia.org/P17197 do you want me to create a task for this? The host has been set to decommissioning on netbox but the IP isn't removed, should I remove it manually from there? [05:37:14] That's the mgmt IP btw [05:39:58] From netbox log I can see it actually deleted the IP [05:40:14] Deleted IP Address 10.192.0.104/22 [05:40:14] sre_bot - 2021-09-03 05:21 and that's pc2007's IP [06:47:52] marostegui: no need to touch netbox [06:48:04] at first look there is wrong data there though [06:48:17] error: CNAME 'pc1-master.eqiad.wmnet.' points to known same-zone NXDOMAIN 'pc2007.codfw.wmnet.' [06:48:54] I'm not at ny laptop, I can have a look in ~1h [06:49:08] volans: no rush at all [06:49:10] if the data il correct in netbox a run of the sre.dns [06:49:33] sre.dns.netbox cookbook will fix it [06:49:57] Let me see [06:49:58] Thanks! [06:50:41] I have a patchset to actually replace pc2007.codfw.wmnet as that was the pc1-master (and that is wrong) [06:50:44] I will investigate, thank you! [07:12:52] marostegui: the problem is eqiad vs codfw [07:12:57] afict from mobile [07:13:13] see my line above [07:13:56] volans: there was something definitely wrong which was that pc1-master was pointing to pc2007.codfw.wmnet [07:14:00] and pc2007 was being decommissioned [08:29:03] marostegui: I'm fully online now if I can help [08:29:03] marostegui: argh. thanks for finding/fixing that [08:29:09] *still [08:34:08] volans: I think we are good, thanks! [08:34:13] kormat: no worries!! [08:36:26] marostegui: volans: was the problem related to https://wikitech.wikimedia.org/w/index.php?title=MariaDB%2FDecommissioning_a_DB_Host#[Only_codfw_hosts,_until_they_are_migrated_to_Netbox]_Remove_DNS_production_entries? [08:36:50] And is that point even still true? [08:37:49] no, not true anymore since long time [08:38:36] Good to know, let me fix that page [10:46:00] I'm trying to upload a file to phab, and keep getting the error "via cp3062.esams.wmnet, ATS/8.0.8, 502, Next Hop Connection Falied" is there something going on? Should I just keep trying? [10:57:49] dcaro: upload went through fine for me just now [10:57:57] Is it a large file? [10:58:53] 892K, a dell support hardware logs dump [10:59:25] oh, got an error just trying to create a task :/ (408, inactive timeout) [10:59:47] might be my connection, but feels weird [10:59:56] Definately weird [11:00:38] 220KB was the test file I did [11:00:58] ok, now creating the tasks worked well right away, let me retry the file upload [11:01:06] I doubt your connection if you get the wikimedia error page [11:01:18] that was my thought too [11:02:46] still getting the next hop connection failed error [11:02:51] dcaro: you might find more in the logs [11:04:05] I have never touched any phabricator/non-cloud systems, so I'll need some guidance :), where can I find the logs? [11:04:37] dcaro: probably logstash [11:04:48] Doesn't the error give a request id [11:05:39] no, just says "request from - via cp3056.esams.wmnet, ATS/8.0.8 Error: 502, Next Hop Connection Failed at 2021-09-03 11:01:24 GMT" [11:06:05] after the "Error, our servers are currently under maintenance ... see the error message at the bottom..." [11:07:10] * RhinosF1 waits for someone with access as he doesn't know what other search parameters exist [11:10:09] I'll open a task then [11:10:15] (so I don't forget xd) [11:15:11] for future reference I created (T290321), and it let me upload a screenshot :) [11:15:11] T290321: Phabricator next hop when trying to upload a file - https://phabricator.wikimedia.org/T290321 [11:23:48] * sobanski heading out to a doctors appointment, will be unavailable for 2h or so. [14:03:30] if a host is marked on netbox as failed [14:03:35] and I want to decommission it [14:03:58] what should I do ? I run the decomm cookbook but it failed :/ [14:04:43] I also have an alert "Domain mc1026.mgmt.eqiad.wmnet was not found by the server" which I am not sure how to fix either [14:08:21] effie: you can try to run it again but use --force [14:09:02] first I thought it was --no-verify but that is for the reimage [14:09:29] mutante: the decommission cookbook ? [14:09:47] effie: yes [14:09:58] effie: though.. I see the status in netbox is now already "decom" [14:11:03] force is if you want to decomm more than 5 servers at once [14:11:34] oh, right, I mixed that up with --no-verify on wmf-reimage [14:11:36] mc1027 is marked as failed [14:11:46] ok, I looked at mc1026 [14:11:49] about the DNS alert for mc1026.mgmt.eqiad.wmnet [14:11:54] I have no idea [14:12:47] hmm. best I have then is to save the relevant logs on a pastebin or ticket and report to v.olans [14:13:08] sounds like it is removed partially [14:13:26] yeah there was a hicckup with 1026 [14:13:33] but 1027 died months ago [14:13:38] thank you ! [14:13:40] I will file a task [14:13:59] maybe dcops knows more about how to handle the Faile dstate [14:14:43] just noticed there is an alert for mc1026 as well [14:14:57] (being down) [14:15:15] yea, this would make sense if it was not removed from puppetdb [14:15:19] but was still shut down [14:15:53] i'd run the downtime cookbook "manually" for now to silence it over the weekend [14:16:15] or.. we need to go back to manually removing it from puppetdb like we did before the cookbook [14:16:47] if the failure was limited to setting the icinga downtime.. then it's a known race condition [14:16:53] if there was more to it then not sure [14:18:32] [cumin1001:~] $ sudo cookbook sre.hosts.downtime -r "a race?" -H 128 mc1026.eqiad.wmnet [14:18:37] or so [14:20:10] the host is decommed, there should be no icinga alert for it [14:21:12] yea, but it's a thing that happens sometimes.. it does the other decom steps but the icinga downtimes fails. and it will only disappear from icinga once it is removed from puppetdb and puppet ran on alert* [14:21:32] I see [14:21:37] you can try to run puppet on alert* to see if it removes it now from icinga config or not [14:22:03] and if it's still in there you can set the downtime for a couple days [14:22:54] or find the old docs on how to kill hosts from puppetdb (what the decom script does but do it yourself) but probably better to find out via the ticket [14:22:56] hi all just a heads up i plan ti disable puppet at 15:00 to kick opf the puppetdb maintance work [14:25:16] ok I will let things be for the time being [14:25:18] file a task [14:25:28] and deal with it all on monday [14:26:07] sounds good, have a good weekend [14:26:19] effie: also.. Monday is a major US holiday, probably no meeting [14:26:22] and no US people [14:27:39] thanks for letting me know [14:38:34] what's up wrt decom? [15:18:14] volans: I didn't want to bother you about it on a friday evening [15:18:16] https://phabricator.wikimedia.org/T290326 [15:18:23] so I made a task for monday! [15:19:43] I'll have a look, thx [15:53:58] fyi all the puppetdb maintance finsished much quickler then expected as such i have re-eabled puppet [15:54:34] \o/ thanks jbond ! [16:02:30] quick, time to deploy things :-D [16:05:16] and blame j.bond if it doesn't work ;) [16:05:42] Hold on there now kids.... it hasn't stopped being Friday :D