[00:58:39] I'm working on building a new plugin version (https://gerrit.wikimedia.org/r/c/operations/software/elasticsearch/plugins/+/738979), but running into the error here: https://phabricator.wikimedia.org/T295705#7512131 any guesses? [01:00:02] It's interesting that the error message says `pool/component/elastic65/w/wmf-elasticsearch-search-plugins/wmf-elasticsearch-search-plugins_6.5.4-6~stretch_all.deb` when I'm trying to include `wmf-elasticsearch-search-plugins_6.5.4-7_amd64.changes` [01:02:07] * legoktm looks [01:03:34] if it isn't my favorite resident debian expert :D [01:03:35] ryankemper: how did you build the package? [01:04:52] for some reason the changes file includes wmf-elasticsearch-search-plugins_6.5.4-6~stretch_all.deb which is wrong, presumably it should be 6.5.4-7~something [01:05:14] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/elasticsearch/plugins/+/refs/heads/master/debian/rules#8 ? [01:06:03] cwhite: legoktm: ah, looks like it's as simple as bumping that [01:06:06] that looks like it [01:06:41] Weird that the patch version can't be inferred from the changelog :/ [01:07:32] it can, let me find the snippet [01:22:30] https://git.dpkg.org/cgit/dpkg/dpkg.git/tree/scripts/mk/pkg-info.mk [01:23:06] neat [01:28:24] so you should be able to include that as `/usr/share/dpkg/pkg-info.mk` [01:28:27] and then reference those variables [07:47:36] dcausse: o/ anything ongoing with wdqs1006? [07:50:13] <_joe_> elukey: yes, sadly we found out it's running blazegraph [07:50:29] <_joe_> probably the only option is to reinstall the server in order to remove the malware [07:51:31] okok :D [07:52:44] elukey: I like how https://gerrit.wikimedia.org/r/c/operations/puppet/+/736753/ turned a quick script into a discussion about completely re-doing that component :D [07:53:41] <_joe_> majavah: that's software engineering for you [07:56:09] majavah: I didn't want to imply "you have to do it all now", but I am wondering what could be better in the medium/long term since the code is growing a lot.. Even a simple separate gerrit repo would suffice, and the initial code could be basically what you are doing now (like having a module etc..) :) [07:57:08] I can definitely work on it or help if we decide to do it, I don't want to put all the load on you! [07:57:27] <_joe_> moritzm: re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/730836 - if you want a good correspondence between server name and primary role [07:58:44] <_joe_> the best way is to inspect the $_role global (which I think isn't stored in puppetdb) and/or modify slightly the role() function (modules/wmflib/lib/puppet/functions/role.rb) [07:59:19] <_joe_> we can make it declare a noop resourced called 'primary::role' [07:59:44] <_joe_> something similar to cumin::selector, if you want [08:00:22] <_joe_> oh wait [08:00:32] <_joe_> we already do it in cumin::target [08:05:24] <_joe_> and in fact [08:07:18] <_joe_> query_resources("", "Class[Profile::Cumin::Target]").map |$x| {{$x['certname'] => $x["tags"].filter |$t| { "role::" in $t}}} gives you what you want [08:07:55] <_joe_> look how beautiful is puppet [08:08:41] ack. but at this point role::mediawiki::common is the only special case, I'll simply track it separately [08:14:46] <_joe_> but yeah it's not great the way we built that tag tbh [08:15:06] <_joe_> we needed to have something like primary::role:: [14:29:40] herron: i see you merged the exim change, did everything go ok? [14:30:38] jbond: it updated the template that would be deployed when we switch the ldap enabled hiera back, but it was effectively a noop sof ar [14:30:52] ahh ok course thanks :) [14:31:26] herron: moritzm: are you free to try the change again now [14:32:15] https://gerrit.wikimedia.org/r/c/operations/puppet/+/739641 [14:32:21] jbond: but fwiw the ldap hiera is disabled on deployment-mx03.deployment-prep.eqiad1.wikimedia.cloud and the change deployed there if you'd like to have a look [14:32:35] jbond: sure [14:32:51] sure, let's do that [14:33:04] ack merging now [14:33:47] disabled puppet on mx1001 [14:35:10] herron: should be a noop on 1001 [14:35:18] change deployed to 2001 [14:35:24] yeah just out of paranoia [14:40:19] test mail to my wikimedia.org address routed via mx2001 went fine [14:40:43] opened a new test ticket via mx2001 as well, checking with Alex if it arrived [14:40:54] test mail to my own account worked fine but i see that mail destined for otrs-test@wikimedia.org are going via gsuite [14:42:00] ah indeed, just got the bounce [14:42:31] yes the otrs router is missed https://phabricator.wikimedia.org/P17773 [14:49:50] I think we have the wrong format in /etc/exim4/otrs_emails for the lookup as configured. after changing the line to e.g. 'otrs-test@wikimedia.org: foo' the otrs router is used [14:51:26] herron: ack thanks ill revert for now and send a fix [14:51:39] ok sounds good [14:51:57] yeah, indeed, https://www.exim.org/exim-html-current/doc/html/spec_html/ch-file_and_database_lookups.html has the spec which needs a colon [14:52:04] +1 on reverting [14:52:17] ahh great thanks [15:03:51] fyi _joe_ if you havn't allready checkout puppetdb_query using pql. its, imo, much nicer then the puppetdbquery functions https://github.com/wikimedia/puppet/blob/production/modules/wmflib/functions/role_hosts.pp#L26-L30 [15:06:38] for some definition of nicer™ ;) [15:07:42] :) [15:16:03] herron: moritzm: i have applied a fix and regenerated the aliases file. want to try again with https://gerrit.wikimedia.org/r/c/operations/puppet/+/739826 or would yuo prefer to wait? [15:22:25] in fact i have 90 mins of meetings starting in 10 mins so lets leave it untill monday (there is no urgency on this) [15:22:52] ack, sounds good. let's give this another shot on Monday [15:23:34] +1 monday works [15:41:36] SRE: note from -ops: 15:39 < XioNoX> !log lvs2007:~$ sudo service pybal stop - T295118 [15:41:37] T295118: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 [15:42:06] while this work is ongoing, there's no safe way to provision any changes to pybal config for services (because we lack the redundancy to do the rolling restarts the normal way) [15:42:15] so please hold on any pybal-affecting service changes in codfw [15:49:05] FYI, I'm going to shutdown the switch, servers listed on https://phabricator.wikimedia.org/T295118#7509294 will be unreachable for a bit [15:56:09] XioNoX: *crosses fingers* [16:04:59] Swift latency up a bit [16:18:06] look like the new switch might be faulty as well, we're looking into it and will use spare #2 if needed (but it will take a bit more time) [16:42:45] ugh, hardware. It's almost as bad as software :) [17:01:50] XioNoX: would you care to hazard a guess as to how long the downtime is likely to continue, please? [17:03:32] Emperor: the new switch got upgraded, and is being racked so it should be no more than 30min if everything goes well [17:08:49] Thanks. [17:33:26] Emperor: servers are coming up [17:34:37] Thanks, I see thanos backend is back [17:35:28] likewise the ms backends [17:35:53] hopefully swift latency will drop again soon [17:37:24] then I can eat :) [17:38:02] [latency still up, but I'm guessing that's some backfilling now the nodes are up again] [18:08:47] latency back down again :) [18:51:11] Emperor and all, we're going to reboot the same ToR switch as it's not letting IPv6 go through (causing the previous outage) [19:08:04] current status about codfw: the switch reboot didn't help, and it's still discarding IPv6 traffic [19:08:29] hooray 🙃 [19:08:43] codfw is still depooled, and better to keep it depooled until row B is healthy [19:35:21] ETA ~2h before being able to replace it with a different spare [20:16:30] arnoldokoth: I merged your patch 'site: include new k8s hosts on kubestage group' [20:16:34] (partly by accident) [20:21:09] andrewbogott: Hehe. No worries. [21:48:53] one random appserver (mw1448) - alert: DNS CRITICAL - expected '0.0.0.0' but got '10.65.1.26' .. wut? why would it expect 0.0.0.0 for anything in mgmt [21:49:12] and server seems random, not in SAL or hardware ticket or something [21:49:28] since 2d 1h [21:49:45] it's the DNS check [21:53:02] oh: [21:53:03] [puppetmaster1001:~] $ host 10.65.1.26 [21:53:03] 26.1.65.10.in-addr.arpa domain name pointer mw1448.mgmt.eqiad.wmnet. [21:53:03] 26.1.65.10.in-addr.arpa domain name pointer wmf5023.mgmt.eqiad.wmnet. [21:53:57] no, that's also ok. netbox and DNS look consistent. duno why icinga thinks that [21:56:15] mutante: where does it take the "expected" part from? [21:58:22] XioNoX: might be netbox [22:01:20] ok, so the one we call "DNS" in Icinga UI is using check_fqdn command.. looking [22:03:01] but check_fqdn is actually check_dns [22:03:09] check_dns -H $ARG1$ -a $HOSTADDRESS$ [22:03:28] http://nagios-plugins.org/tag/check_dns/ [22:06:51] Optional IP-ADDRESS you expect the DNS server to return. but /usr/lib/nagios/plugins/check_dns is a binary [22:08:27] https://github.com/nagios-plugins/nagios-plugins/blob/master/plugins/check_dns.c "/* compare to expected address */" [22:09:16] no idea why it gets 0.0.0.0 in this one case though [22:13:16] in the actual icinga config it is a line like any other. check_command check_fqdn!mw1448.mgmt.eqiad.wmnet . so it does not seem to be a thing that results in different icinga command for this among everything else.. odd.. but also just one single mgmt host [22:13:49] moving on for now from it, let's see if it sticks around [22:15:20] well, one more thing as a bonus. I can also run the same command myself and it's fine. : [22:15:26] [alert1001:/etc/icinga/objects] $ /usr/lib/nagios/plugins/check_dns mw1448.mgmt.eqiad.wmnet [22:15:29] DNS OK: 0.017 seconds response time. mw1448.mgmt.eqiad.wmnet returns 10.65.1.26|time=0.016746s;;;0.000000 [22:15:33] shrug [22:16:35] mutante: https://phabricator.wikimedia.org/T293610 [22:16:54] That's a previous occurrence of the error and it was a hw fault [22:17:30] RhinosF1: that is a great find. thank you . though I dont understand how the cable could be it, heh [22:17:59] mutante: no idea how but I knew I'd heard the error before [22:18:18] that was good, it did seem vaguely familiar, yes [22:18:57] My memory is full of random facts about old tickets I've been nosey about and read [22:19:14] ;) [22:20:15] I guess it'll need a ticket for ops-eqiad to look [22:20:57] yea, i'll create one soonish..if it doesnt go away [22:21:00] mutante: [22:21:07] define host { [22:21:07] address 0.0.0.0 [22:21:07] check_command check_ping!500,20%!2000,100% [22:21:07] check_period 24x7 [22:21:07] contact_groups admins,admins [22:21:07] host_name mw1448.mgmt [22:21:07] hostgroups mgmt [22:21:08] max_check_attempts 2 [22:21:08] notification_interval 0 [22:21:09] notification_options d,u,r,f [22:21:09] notification_period 24x7 [22:21:10] notifications_enabled 1 [22:21:10] parents msw1-eqiad [22:21:11] } [22:21:20] er, sorry I wanted to send a snippet [22:21:21] oh, lol? [22:21:32] see the IP "0.0.0.0" [22:21:38] why did I not see that, was in there just now :) [22:21:39] yea [22:22:04] XioNoX: where does it get that data from though [22:22:08] race condition between netbox puppetdb and icinga somewhere? [22:22:23] @alert1001:/etc/icinga/objects$ less puppet_hosts.cfg [22:22:31] how did the other ticket get fixed by replacing a cable?:) [22:22:38] maybe you were in the puppet_services.cfg [22:22:48] thats right, yea [22:23:10] I am tempted to manually throw that out of the config and refresh icinga [22:23:18] then run puppet and see if it adds it back or not [22:23:30] Might wake it up [22:24:07] maybe some stale cache too [22:24:44] let me delete that host and check if icinga config is still happy [22:25:17] well, or rather.. manually fix the IP of it [22:25:25] and then run pupet [22:27:04] * RhinosF1 off to lie in bed and stare and maybe fall asleep [22:27:19] edited the file, running icinga config check (OK), running puppet [22:28:19] !log icinga (alert1001) - manually fix IP of mw1488.mgmt (was 0.0.0.0 is: 10.65.1.26) in /etc/icinga/objects/puppet_hosts.cfg , running puppet [22:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:55] it adds it back as 0.0.0.0 [22:29:04] so then puppetdb [22:29:10] or netbox [22:49:39] mutante: it's a custom fact generated by mw1448. `sudo /usr/sbin/bmc-config -o -S Lan_Conf` on mw1448 renders the 0.0.0.0 [22:55:27] cwhite: oooh, thank you! I guess. it starts to make more sense how a broken cable can result in ths [22:56:10] or broken DRAC [23:16:53] I updated https://docs.google.com/document/d/1s56_keYG8J58nZjH5tJLsiLeWDVPTRe43NRakCuMFmA/edit# with the lastest on the codfw row B issue. So far the root cause in unknown and impact mitigated [23:29:23] Thanks XioNoX. Given there's not much coordination needed at this point, I'm going to step down as IC. Anyone can pick it up again if need be.