[04:35:33] <marostegui>	 Reminder that I am half day off today
[04:35:33] <marostegui>	 Reminder that I am half day off today
[08:00:54] <kormat>	 marostegui: ditto :)
[08:00:54] <kormat>	 marostegui: ditto :)
[08:04:09] * Emperor rolls a tumbleweed
[08:04:09] * Emperor rolls a tumbleweed
[08:14:22] <kormat>	 grumble. i can understand _why_ debian/copyright doesn't support specifying "copyright unknown for these files", but it's very inconvenient :(
[08:14:22] <kormat>	 grumble. i can understand _why_ debian/copyright doesn't support specifying "copyright unknown for these files", but it's very inconvenient :(
[08:19:31] * Emperor twitches
[08:19:31] * Emperor twitches
[08:20:19] <kormat>	 Emperor: lintian outputs a healthy 1789 lines of `file-without-copyright-information`
[08:20:19] <kormat>	 Emperor: lintian outputs a healthy 1789 lines of `file-without-copyright-information`
[08:20:59] <Emperor>	 that sounds like most of the source tree, which is surely not right since presumably we have some reason to think we can use this code?
[08:20:59] <Emperor>	 that sounds like most of the source tree, which is surely not right since presumably we have some reason to think we can use this code?
[08:21:34] <kormat>	 i'm loving how lintian does not output the full path, too
[08:21:34] <kormat>	 i'm loving how lintian does not output the full path, too
[08:22:05] <kormat>	 Emperor: the top-level go 'vendor' dir is excluded by the existing debian/copyright
[08:22:05] <kormat>	 Emperor: the top-level go 'vendor' dir is excluded by the existing debian/copyright
[08:22:10] <kormat>	 it contains many files
[08:22:10] <kormat>	 it contains many files
[08:22:21] <Emperor>	 😱
[08:22:21] <Emperor>	 😱
[08:22:52] <kormat>	 there's ~33 different projects in there
[08:22:52] <kormat>	 there's ~33 different projects in there
[08:22:56] <Emperor>	 😱
[08:22:56] <Emperor>	 😱
[08:23:06] <Emperor>	 this is orchestrator, right?
[08:23:06] <Emperor>	 this is orchestrator, right?
[08:23:12] <kormat>	 i have strong feelings about needing to investigate and configure the copyright stuff for all of them
[08:23:12] <kormat>	 i have strong feelings about needing to investigate and configure the copyright stuff for all of them
[08:23:13] <kormat>	 yep
[08:23:13] <kormat>	 yep
[08:23:36] <kormat>	 https://github.com/openark/orchestrator/tree/master/vendor
[08:23:36] <kormat>	 https://github.com/openark/orchestrator/tree/master/vendor
[08:25:43] <Emperor>	 so presumably upstream are implicitly claiming "everything in vendor/ that we grabbed from elsewhere is compatible with our apache-2 license, honest"?
[08:25:43] <Emperor>	 so presumably upstream are implicitly claiming "everything in vendor/ that we grabbed from elsewhere is compatible with our apache-2 license, honest"?
[08:26:17] <kormat>	 Emperor: that would be how i'd read the situation too, yeah
[08:26:17] <kormat>	 Emperor: that would be how i'd read the situation too, yeah
[08:26:37] <Emperor>	 [in fairness, the first one I looked at has a MIT license]
[08:26:37] <Emperor>	 [in fairness, the first one I looked at has a MIT license]
[08:27:29] <Emperor>	 If we were wanting to get this package in Debian, that mess would need fixing. But we don't, I assume, so I think lintian overrides for the copyright fail is probably the way to go
[08:27:29] <Emperor>	 If we were wanting to get this package in Debian, that mess would need fixing. But we don't, I assume, so I think lintian overrides for the copyright fail is probably the way to go
[08:27:45] <kormat>	 we definitely don't re: get this into debian
[08:27:45] <kormat>	 we definitely don't re: get this into debian
[08:27:52] <kormat>	 ohh. tell me more :)
[08:27:52] <kormat>	 ohh. tell me more :)
[08:28:42] <Emperor>	 you can override lintian warnings to make them go away ; dh_lintian (1) installs the override files for you
[08:28:42] <Emperor>	 you can override lintian warnings to make them go away ; dh_lintian (1) installs the override files for you
[08:29:52] <kormat>	 https://www.debian.org/doc/manuals/maint-guide/dother.en.html#lintian points to https://lintian.debian.org/manual/index.html, which says "Sorry, page not found"
[08:29:52] <kormat>	 https://www.debian.org/doc/manuals/maint-guide/dother.en.html#lintian points to https://lintian.debian.org/manual/index.html, which says "Sorry, page not found"
[08:30:32] <Emperor>	 Let me find the thing you want to read
[08:30:32] <Emperor>	 Let me find the thing you want to read
[08:30:37] <kormat>	 thanks :)
[08:30:37] <kormat>	 thanks :)
[08:31:46] <Emperor>	 If you want HTML and have lintian installed, it's file:///usr/share/doc/lintian/lintian.html/index.html
[08:31:46] <Emperor>	 If you want HTML and have lintian installed, it's file:///usr/share/doc/lintian/lintian.html/index.html
[08:32:01] <Emperor>	 (probably actually just section 2.4 which is on overrides)
[08:32:01] <Emperor>	 (probably actually just section 2.4 which is on overrides)
[08:32:43] <Emperor>	 you're already using debhelper for the package build, so dh_lintian should be run automatically for you (and its manual tells you where to put the overrides)
[08:32:43] <Emperor>	 you're already using debhelper for the package build, so dh_lintian should be run automatically for you (and its manual tells you where to put the overrides)
[08:35:19] <kormat>	 🤞 running now
[08:35:19] <kormat>	 🤞 running now
[08:35:34] <Emperor>	 cool
[08:35:34] <Emperor>	 cool
[08:36:00] * Emperor is reading the incident reporting stuff having got the email about ONFIRE; bit surprised this isn't in the onboarding library-of-things-to-read
[08:36:00] * Emperor is reading the incident reporting stuff having got the email about ONFIRE; bit surprised this isn't in the onboarding library-of-things-to-read
[09:02:49] <kormat>	 `E: orchestrator source: source-is-missing resources/public/js/cluster-pools.js line length is 340 characters (>256)`
[09:02:49] <kormat>	 `E: orchestrator source: source-is-missing resources/public/js/cluster-pools.js line length is 340 characters (>256)`
[09:02:56] <kormat>	 👀
[09:02:56] <kormat>	 👀
[09:03:41] <kormat>	 despite having `source-is-missing resources/public/js/*.js` in the overrides
[09:03:41] <kormat>	 despite having `source-is-missing resources/public/js/*.js` in the overrides
[09:04:35] <volans>	 is it supposed to be a bash globbing or regex?
[09:04:35] <volans>	 is it supposed to be a bash globbing or regex?
[09:06:30] <Emperor>	 "If you add an asterisk (*) in the additional info, this will match arbitrary strings similar to the shell wildcard. "
[09:06:31] <Emperor>	 "If you add an asterisk (*) in the additional info, this will match arbitrary strings similar to the shell wildcard. "
[09:07:36] <volans>	 *similar* is the key word I guess :-P
[09:07:37] <volans>	 *similar* is the key word I guess :-P
[09:08:42] <godog>	 hi folks, I'll be kicking off another rebalance of swift eqiad, this time with more weight to the new hosts
[09:08:43] <godog>	 hi folks, I'll be kicking off another rebalance of swift eqiad, this time with more weight to the new hosts
[09:15:47] <jynus>	 ok, thanks
[09:15:48] <jynus>	 ok, thanks
[09:18:45] <Emperor>	 (I wrestle lintian overrides infrequently enough that I'm starting roughly anew each time, and it's often frustrating, sorry :-/ )
[09:18:45] <Emperor>	 (I wrestle lintian overrides infrequently enough that I'm starting roughly anew each time, and it's often frustrating, sorry :-/ )
[09:58:27] <kormat>	 the glob _does_ work
[09:58:27] <kormat>	 the glob _does_ work
[09:58:38] <kormat>	 if i remove that line, other files in that dir produce errors
[09:58:38] <kormat>	 if i remove that line, other files in that dir produce errors
[09:59:12] <kormat>	 the ones which don't get suppressed are the ones with the line length addition
[09:59:12] <kormat>	 the ones which don't get suppressed are the ones with the line length addition
[11:38:41] <marostegui>	 I am going to warm up eqiad again and get ready for tomorrow
[11:38:41] <marostegui>	 I am going to warm up eqiad again and get ready for tomorrow
[12:10:03] <kormat>	 > Please note, that very-long-line-length-in-source-file tagged files are likely tagged source-is-missing. It is a feature not a bug.
[12:10:03] <kormat>	 > Please note, that very-long-line-length-in-source-file tagged files are likely tagged source-is-missing. It is a feature not a bug.
[12:10:06] <kormat>	 thanks, lintian :/
[12:10:06] <kormat>	 thanks, lintian :/
[12:15:07] <Emperor>	 ugh
[12:15:07] <Emperor>	 ugh
[12:15:19] <kormat>	 Emperor: as a bonus, adding _that_ tag also doesn't fix it
[12:15:19] <kormat>	 Emperor: as a bonus, adding _that_ tag also doesn't fix it
[12:20:31] <Emperor>	 :(
[12:20:31] <Emperor>	 :(
[12:21:45] <kormat>	 on a different error, it's not possible to use lintian-override to get ride of bad-distribution-in-changes-file
[12:21:45] <kormat>	 on a different error, it's not possible to use lintian-override to get ride of bad-distribution-in-changes-file
[12:21:56] <kormat>	 because the .changes file does not appear in the source tree
[12:21:56] <kormat>	 because the .changes file does not appear in the source tree
[12:24:50] <kormat>	 wow. ok. so `source-is-missing resources/public/js/*.js`, which matches the affected files, doesn't work
[12:24:51] <kormat>	 wow. ok. so `source-is-missing resources/public/js/*.js`, which matches the affected files, doesn't work
[12:24:58] <kormat>	 but `source-is-missing *` _does_ work
[12:24:58] <kormat>	 but `source-is-missing *` _does_ work
[12:27:08] <kormat>	 N: 1797 tags overridden (9 errors, 1788 warnings)
[12:27:08] <kormat>	 N: 1797 tags overridden (9 errors, 1788 warnings)
[12:28:04] <Emperor>	 \o/
[12:28:04] <Emperor>	 \o/
[12:30:33] <Emperor>	 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=575400 has a CLI work-around for the -changes thing
[12:30:33] <Emperor>	 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=575400 has a CLI work-around for the -changes thing
[12:31:16] <kormat>	 best part of that: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=575400#42
[12:31:16] <kormat>	 best part of that: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=575400#42
[12:31:33] <kormat>	 Emperor: so there's a way to supply a flag on the cmdline. but what i'm doing is `debuild -d -us -uc`
[12:31:33] <kormat>	 Emperor: so there's a way to supply a flag on the cmdline. but what i'm doing is `debuild -d -us -uc`
[12:31:34] * Emperor has just dutifully reported that as spam
[12:31:34] * Emperor has just dutifully reported that as spam
[12:31:48] <kormat>	 i do not wish to have to add whatever awful incantations are required to pass that flag to lintian
[12:31:48] <kormat>	 i do not wish to have to add whatever awful incantations are required to pass that flag to lintian
[12:32:01] <kormat>	 if there was some way to do it in debian/rules, however, i'm all ears.
[12:32:01] <kormat>	 if there was some way to do it in debian/rules, however, i'm all ears.
[12:32:33] <Emperor>	 I have no idea what your objection to typing "--lintian-opts --suppress-tags bad-distribution-in-changes-file" is ;-)
[12:32:33] <Emperor>	 I have no idea what your objection to typing "--lintian-opts --suppress-tags bad-distribution-in-changes-file" is ;-)
[12:32:57] <kormat>	 🥀
[12:32:57] <kormat>	 🥀
[12:33:41] <Emperor>	 that glyph is quite hard to parse [though I am beginning to recognise it as WILTED FLOWER]
[12:33:41] <Emperor>	 that glyph is quite hard to parse [though I am beginning to recognise it as WILTED FLOWER]
[12:33:54] <kormat>	 it looks _very_ pretty on irccloud
[12:33:54] <kormat>	 it looks _very_ pretty on irccloud
[12:34:14] <kormat>	 https://usercontent.irccloud-cdn.com/file/6dBlTYVB/image.png
[12:34:14] <kormat>	 https://usercontent.irccloud-cdn.com/file/6dBlTYVB/image.png
[12:36:04] <kormat>	 Emperor: is there a way to provide lintian options to dh_lintian?
[12:36:04] <kormat>	 Emperor: is there a way to provide lintian options to dh_lintian?
[12:36:11] <Emperor>	 https://www.chiark.greenend.org.uk/~matthewv/qsj/wilted_flower.png
[12:36:11] <Emperor>	 https://www.chiark.greenend.org.uk/~matthewv/qsj/wilted_flower.png
[12:36:20] <kormat>	 manpage just says it takes "debhelper_options", which is totes helpful
[12:36:20] <kormat>	 manpage just says it takes "debhelper_options", which is totes helpful
[12:36:33] <kormat>	 Emperor: you have a chiark account? 😮
[12:36:33] <kormat>	 Emperor: you have a chiark account? 😮
[12:37:37] <Emperor>	 is everyone's favourite orbital _that_ infamous?
[12:37:37] <Emperor>	 is everyone's favourite orbital _that_ infamous?
[12:38:00] <kormat>	 hi, am nerd. :)
[12:38:00] <kormat>	 hi, am nerd. :)
[12:41:19] <Emperor>	 I'm afraid that the answer to your question re dh_lintian is probably "no", though
[12:41:19] <Emperor>	 I'm afraid that the answer to your question re dh_lintian is probably "no", though
[12:41:51] <Emperor>	 I mean, you could set that override in your lintian conf, but that might be more invasive than you wanted
[12:41:52] <Emperor>	 I mean, you could set that override in your lintian conf, but that might be more invasive than you wanted
[12:42:18] <kormat>	 ah. i'm off-base. dh_lintian doesn't _run_ lintian. it just installs the override files
[12:42:18] <kormat>	 ah. i'm off-base. dh_lintian doesn't _run_ lintian. it just installs the override files
[12:42:39] <kormat>	 debuild runs lintian. so... nevermind!
[12:42:39] <kormat>	 debuild runs lintian. so... nevermind!
[12:50:09] <Emperor>	 If we were building in CI, it'd be worth making the CI have the relevant lintian config
[12:50:09] <Emperor>	 If we were building in CI, it'd be worth making the CI have the relevant lintian config
[13:56:02] <Amir1>	 jynus: hi, is this s4? https://phabricator.wikimedia.org/T275268#7275022
[13:56:03] <Amir1>	 jynus: hi, is this s4? https://phabricator.wikimedia.org/T275268#7275022
[13:56:10] <Amir1>	 That seems a bit too big for s4
[13:56:11] <Amir1>	 That seems a bit too big for s4
[13:56:30] <Amir1>	 hmm, actually with *links tables there, no, it makes sense
[13:56:30] <Amir1>	 hmm, actually with *links tables there, no, it makes sense
[13:56:53] <jynus>	 note not all has to be the immediate optimization
[13:56:53] <jynus>	 note not all has to be the immediate optimization
[13:57:25] <jynus>	 there can be other older fragmentations, even if no rows are deleted, there is always some decrease
[13:57:25] <jynus>	 there can be other older fragmentations, even if no rows are deleted, there is always some decrease
[13:58:15] <Amir1>	 yeah
[13:58:15] <Amir1>	 yeah
[13:58:20] <Amir1>	 I wonder how https://phabricator.wikimedia.org/T289249#7304088 looks like now
[13:58:20] <Amir1>	 I wonder how https://phabricator.wikimedia.org/T289249#7304088 looks like now
[13:58:37] <jynus>	 comparing now eqiad vs codfw
[13:58:37] <jynus>	 comparing now eqiad vs codfw
[13:59:03] <jynus>	 1624897103716 vs 1927575031078
[13:59:03] <jynus>	 1624897103716 vs 1927575031078
[13:59:07] <jynus>	 in bytes
[13:59:07] <jynus>	 in bytes
[14:02:23] <Amir1>	 300GB redaction. Cool
[14:02:23] <Amir1>	 300GB redaction. Cool
[14:03:16] <Amir1>	 I will get djvu thing handled soon, that'd be 40GB ish too
[14:03:16] <Amir1>	 I will get djvu thing handled soon, that'd be 40GB ish too
[14:31:38] <jynus>	 thanks
[14:31:38] <jynus>	 thanks
[14:32:15] <jynus>	 I will be waiting for the switchover windows to pass to do non priority maintenance
[14:32:15] <jynus>	 I will be waiting for the switchover windows to pass to do non priority maintenance
[15:02:23] <Emperor>	 ms-be2045 is odd - no sign of a shutdown, but no sign of anything obviously untorward
[15:02:23] <Emperor>	 ms-be2045 is odd - no sign of a shutdown, but no sign of anything obviously untorward
[15:02:49] <Emperor>	 I'm guessing we don't have netconsole or similar for these systems?
[15:02:49] <Emperor>	 I'm guessing we don't have netconsole or similar for these systems?
[15:03:10] <godog>	 yeah that's correct no netconsole :|
[15:03:10] <godog>	 yeah that's correct no netconsole :|
[15:03:46] <godog>	 and only two HDDs detected, looks like the host and/or controller threw their toys out of the pram
[15:03:46] <godog>	 and only two HDDs detected, looks like the host and/or controller threw their toys out of the pram
[15:04:09] <Emperor>	 do we have iDRAC or what-have-you to check if the BIOS saw anything odd?
[15:04:09] <Emperor>	 do we have iDRAC or what-have-you to check if the BIOS saw anything odd?
[15:04:28] <moritzm>	 there's errors in SEL, though:
[15:04:28] <moritzm>	 there's errors in SEL, though:
[15:04:56] <godog>	 Emperor: yes
[15:04:56] <godog>	 Emperor: yes
[15:04:58] <moritzm>	 Record:      2
[15:04:58] <moritzm>	 Record:      2
[15:05:00] <moritzm>	 Date/Time:   09/13/2021 15:47:31
[15:05:00] <moritzm>	 Date/Time:   09/13/2021 15:47:31
[15:05:01] <moritzm>	 Source:      system
[15:05:01] <moritzm>	 Source:      system
[15:05:03] <moritzm>	 Severity:    Critical
[15:05:03] <moritzm>	 Severity:    Critical
[15:05:04] <moritzm>	 Description: A fatal error was detected on a component at bus 59 device 0 function 0.
[15:05:04] <moritzm>	 Description: A fatal error was detected on a component at bus 59 device 0 function 0.
[15:05:07] <moritzm>	 and shortly after: 
[15:05:07] <moritzm>	 and shortly after: 
[15:05:07] <godog>	 I have a meeting which I'm already late for, bbiab
[15:05:07] <godog>	 I have a meeting which I'm already late for, bbiab
[15:05:13] <moritzm>	 Description: A fatal error was detected on a component at bus 58 device 2 function 0.
[15:05:13] <moritzm>	 Description: A fatal error was detected on a component at bus 58 device 2 function 0.
[15:05:24] <godog>	 feel free to do anything you feel necessary
[15:05:24] <godog>	 feel free to do anything you feel necessary
[15:06:06] <moritzm>	 Emperor: you can connect to the serial console via root@ms-be2045.mgmt.codfw.wmnet
[15:06:06] <moritzm>	 Emperor: you can connect to the serial console via root@ms-be2045.mgmt.codfw.wmnet
[15:06:21] <moritzm>	 and then "racadm getsel" since it's one of our Dells
[15:06:21] <moritzm>	 and then "racadm getsel" since it's one of our Dells
[15:06:27] <Emperor>	 TY
[15:06:27] <Emperor>	 TY
[15:06:37] <moritzm>	 password is in pwstore under "management"
[15:06:37] <moritzm>	 password is in pwstore under "management"
[15:06:58] <moritzm>	 I'm currently not sure how to map bus 58 device 2 to anything, though
[15:06:58] <moritzm>	 I'm currently not sure how to map bus 58 device 2 to anything, though
[15:07:38] <moritzm>	 but the server is under warranty for two more months
[15:07:38] <moritzm>	 but the server is under warranty for two more months
[15:08:05] <moritzm>	 so it's best to open a task in Phabricator, tag it with "ops-codfw" and have Papaul open a support case
[15:08:05] <moritzm>	 so it's best to open a task in Phabricator, tag it with "ops-codfw" and have Papaul open a support case
[15:10:30] <Emperor>	 At a guess 59 is 0x3b
[15:10:31] <Emperor>	 At a guess 59 is 0x3b
[15:10:35] <Emperor>	 3b:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)
[15:10:35] <Emperor>	 3b:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)
[15:10:35] <Emperor>	  
[15:10:35] <Emperor>	  
[15:10:38] <Emperor>	 from lspci
[15:10:38] <Emperor>	 from lspci
[15:11:23] <Emperor>	 though bus 58 device 2 is less obvious but maybe 3a:02.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port 1C (rev 04)
[15:11:23] <Emperor>	 though bus 58 device 2 is less obvious but maybe 3a:02.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port 1C (rev 04)
[15:11:24] <Emperor>	  
[15:11:24] <Emperor>	  
[15:11:51] <Emperor>	 At the risk of being very low-tech, I'm tempted to try turning it off and on again and see if the drives come back better?
[15:11:51] <Emperor>	 At the risk of being very low-tech, I'm tempted to try turning it off and on again and see if the drives come back better?
[15:12:56] <moritzm>	 +1 on that, low tech beats high tech often enough :-)
[15:12:56] <moritzm>	 +1 on that, low tech beats high tech often enough :-)
[15:14:39] <Emperor>	 I should presumably !log that? 
[15:14:40] <Emperor>	 I should presumably !log that? 
[15:14:46] <Emperor>	 (sorry, still quite new here!)
[15:14:46] <Emperor>	 (sorry, still quite new here!)
[15:18:26] <moritzm>	 yes, best to use !log for a reboot
[15:18:26] <moritzm>	 yes, best to use !log for a reboot
[15:19:39] <moritzm>	 let's day the reboot fixes it and goes down again in a few weeks, then having a trail in server access log makes it easier to map this to a pre-existing problem
[15:19:39] <moritzm>	 let's day the reboot fixes it and goes down again in a few weeks, then having a trail in server access log makes it easier to map this to a pre-existing problem
[15:19:49] <Emperor>	 I've made https://phabricator.wikimedia.org/T290881 for now
[15:19:49] <Emperor>	 I've made https://phabricator.wikimedia.org/T290881 for now
[15:21:08] <Emperor>	 ...if it isn't fixed by reboot, I'll reassign it to ops-codfw / papaul
[15:21:08] <Emperor>	 ...if it isn't fixed by reboot, I'll reassign it to ops-codfw / papaul
[15:22:37] <Emperor>	 (is it straightforward to watch the console as it boots? I'm guessing not since that's almost always a faff...)
[15:22:37] <Emperor>	 (is it straightforward to watch the console as it boots? I'm guessing not since that's almost always a faff...)
[15:24:58] <Emperor>	 The HTML5 console on newer Supermicro kit didn't totally suck...
[15:24:58] <Emperor>	 The HTML5 console on newer Supermicro kit didn't totally suck...
[15:26:51] <Emperor>	 ah, console com2
[15:26:51] <Emperor>	 ah, console com2
[15:27:07] <Emperor>	 ...only just shutting down now, systemd was waiting for something :-/
[15:27:07] <Emperor>	 ...only just shutting down now, systemd was waiting for something :-/
[15:28:38] <marostegui>	 classic 
[15:28:38] <marostegui>	 classic 
[15:29:19] <Emperor>	 systemd still faffing
[15:29:20] <Emperor>	 systemd still faffing
[15:29:49] * Emperor tries to resist the lure of the virtual power button
[15:29:49] * Emperor tries to resist the lure of the virtual power button
[15:30:13] <Emperor>	 "systemd-journald[1175]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected"
[15:30:13] <Emperor>	 "systemd-journald[1175]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected"
[15:30:29] <Emperor>	 shut up and reboot already, I don't care about the WATCHDOG
[15:30:29] <Emperor>	 shut up and reboot already, I don't care about the WATCHDOG
[15:31:53] <moritzm>	 you can follow the console output after you've initiated the reboot via "console com2" on the mgmt interface, so that you also get BIOS output etc.
[15:31:53] <moritzm>	 you can follow the console output after you've initiated the reboot via "console com2" on the mgmt interface, so that you also get BIOS output etc.
[15:32:15] <godog>	 back, thank you Emperor for taking over and moritzm for the assistance!
[15:32:15] <godog>	 back, thank you Emperor for taking over and moritzm for the assistance!
[15:32:43] <Emperor>	 godog: hi, I've made a phab task. Waiting for systemd to let the system reboot, it seems a bit wedged, so may be needing power-cycle shortly
[15:32:43] <Emperor>	 godog: hi, I've made a phab task. Waiting for systemd to let the system reboot, it seems a bit wedged, so may be needing power-cycle shortly
[15:33:28] <godog>	 oof :( phab task for papaul SGTM
[15:33:28] <godog>	 oof :( phab task for papaul SGTM
[15:34:29] <godog>	 possibly related, we've observed disks enumerated in the unexpected order (regardless of controller wedged or not), a reboot "fixes" it, in case it happens
[15:34:29] <godog>	 possibly related, we've observed disks enumerated in the unexpected order (regardless of controller wedged or not), a reboot "fixes" it, in case it happens
[15:35:00] <godog>	 we're using filesystem labels so it isn't a disaster when it happens though
[15:35:00] <godog>	 we're using filesystem labels so it isn't a disaster when it happens though
[15:35:41] <Emperor>	 reboot isn't going, going to power-cycle
[15:35:41] <Emperor>	 reboot isn't going, going to power-cycle
[15:36:00] <godog>	 ack
[15:36:00] <godog>	 ack
[15:36:38] <Emperor>	 if (as seems likely) this turns out to be a hardware thing, I'll reassign the phab item to papaul :)
[15:36:38] <Emperor>	 if (as seems likely) this turns out to be a hardware thing, I'll reassign the phab item to papaul :)
[15:37:28] <Emperor>	 powering up now
[15:37:28] <Emperor>	 powering up now
[15:39:40] <Emperor>	 and booting
[15:39:40] <Emperor>	 and booting
[15:39:59] <Emperor>	 wow, that's a lot of unhappy XFS errors
[15:39:59] <Emperor>	 wow, that's a lot of unhappy XFS errors
[15:41:06] <godog>	 sad_trombone.wav
[15:41:06] <godog>	 sad_trombone.wav
[15:41:19] <Emperor>	 godog: it's back up now, looks like most of the drives are there 
[15:41:19] <Emperor>	 godog: it's back up now, looks like most of the drives are there 
[15:42:09] <godog>	 yeah totally
[15:42:09] <godog>	 yeah totally
[15:42:11] <Emperor>	 mvernon@ms-be2045:~$ sudo dmesg | grep -c 'Shutting down filesystem'
[15:42:12] <Emperor>	 mvernon@ms-be2045:~$ sudo dmesg | grep -c 'Shutting down filesystem'
[15:42:12] <Emperor>	 10
[15:42:12] <Emperor>	 10
[15:43:06] <godog>	 lol :(
[15:43:07] <godog>	 lol :(
[15:43:48] <Emperor>	 So I dunno if you fancy trying to repair those, ask for h/w support, ...?
[15:43:48] <Emperor>	 So I dunno if you fancy trying to repair those, ask for h/w support, ...?
[15:44:40] <godog>	 good question, I'm tempted to nuke the filesystems and let swift rebuild them, and ask h/w support to double check we're good
[15:44:40] <godog>	 good question, I'm tempted to nuke the filesystems and let swift rebuild them, and ask h/w support to double check we're good
[15:44:47] <Emperor>	 I think at least sdc1 sde1 sdf1 sdg1 sdh1 sdi1 sdj1 sdm1 sda3 are unhappy
[15:44:47] <Emperor>	 I think at least sdc1 sde1 sdf1 sdg1 sdh1 sdi1 sdj1 sdm1 sda3 are unhappy
[15:45:07] <godog>	 from scrollback we're cutting it close to warranty anyways, might as well
[15:45:08] <godog>	 from scrollback we're cutting it close to warranty anyways, might as well
[15:45:49] <godog>	 yeah so at some point swift-drive-audit will also come around and unmount the faulty fs, we can preempt that of course
[15:45:49] <godog>	 yeah so at some point swift-drive-audit will also come around and unmount the faulty fs, we can preempt that of course
[15:46:46] <Emperor>	 godog: do you want to do the fs-nuking, and I'll reassign the phab task and ask papaul to get the h/w errors checked?
[15:46:46] <Emperor>	 godog: do you want to do the fs-nuking, and I'll reassign the phab task and ask papaul to get the h/w errors checked?
[15:47:16] <godog>	 Emperor: SGTM! I'll do the nuking here verbosely too 
[15:47:16] <godog>	 Emperor: SGTM! I'll do the nuking here verbosely too 
[15:50:01] <godog>	 so basically first I kicked off the drive-audit cron with /usr/bin/swift-drive-audit /etc/swift/swift-drive-audit.conf, to see what it'll make of the situation
[15:50:01] <godog>	 so basically first I kicked off the drive-audit cron with /usr/bin/swift-drive-audit /etc/swift/swift-drive-audit.conf, to see what it'll make of the situation
[15:51:09] * Emperor imagines sounds of weeping
[15:51:09] * Emperor imagines sounds of weeping
[15:51:42] <godog>	 heheh yeah that's weeping alright
[15:51:42] <godog>	 heheh yeah that's weeping alright
[15:51:43] <godog>	 root      10379  0.0  0.0  23620  1172 pts/1    D+   15:48   0:00  |                       \_ umount -fl /srv/swift-storage/sdc1
[15:51:43] <godog>	 root      10379  0.0  0.0  23620  1172 pts/1    D+   15:48   0:00  |                       \_ umount -fl /srv/swift-storage/sdc1
[15:53:23] <Emperor>	 (I@m still on the console, and it has a pile of hung task warnings
[15:53:23] <Emperor>	 (I@m still on the console, and it has a pile of hung task warnings
[15:54:31] <godog>	 yeah :( I'm thinking about whether the nuking is even worth it 
[15:54:31] <godog>	 yeah :( I'm thinking about whether the nuking is even worth it 
[15:55:18] <godog>	 I'll start with sdc anyways, see what happens
[15:55:18] <godog>	 I'll start with sdc anyways, see what happens
[15:56:11] <godog>	 actually no, I've changed my mind again :) swift-drive-audit unmounted basically all filesystems anyways, might as well wait for the h/w troubleshoot
[15:56:12] <godog>	 actually no, I've changed my mind again :) swift-drive-audit unmounted basically all filesystems anyways, might as well wait for the h/w troubleshoot
[15:56:39] <Emperor>	 OK
[15:56:39] <Emperor>	 OK
[15:56:54] <Emperor>	 looks like sdb3 sdb4 still mounted and that's it
[15:56:54] <Emperor>	 looks like sdb3 sdb4 still mounted and that's it
[15:57:45] <Emperor>	 godog: are you happy with leaving those mounted and thus presumably in-service?
[15:57:45] <Emperor>	 godog: are you happy with leaving those mounted and thus presumably in-service?
[15:58:34] <godog>	 yeah that's fine I think Emperor 
[15:58:34] <godog>	 yeah that's fine I think Emperor 
[16:02:15] <godog>	 I'll be OOO Sept 15th -> 23rd included, the happy path in my mind is h/w troubleshooting is successful and filesystems/disks come back, though we'd still need to nuke them I suspect
[16:02:15] <godog>	 I'll be OOO Sept 15th -> 23rd included, the happy path in my mind is h/w troubleshooting is successful and filesystems/disks come back, though we'd still need to nuke them I suspect
[16:05:37] <Emperor>	 do we need to do anything in the mean time?
[16:05:37] <Emperor>	 do we need to do anything in the mean time?
[16:07:50] <Emperor>	 [and, I guess, relatedly, shall I leave this host 'til you're back from OOO in any case?]
[16:07:50] <Emperor>	 [and, I guess, relatedly, shall I leave this host 'til you're back from OOO in any case?]
[16:09:25] <godog>	 yeah good question, so strictly speaking until we know more about the hw we don't have to do anything, I'm thinking if turning off the host would be better anyways (and nuke the fs first) from swift's pov so it doesn't even try
[16:09:25] <godog>	 yeah good question, so strictly speaking until we know more about the hw we don't have to do anything, I'm thinking if turning off the host would be better anyways (and nuke the fs first) from swift's pov so it doesn't even try
[16:09:54] <volans>	 so for the last minute intrusion... but if we just got 2 live disks, wouldn't be safer to depool it so it's ready for any kind of hardware test?
[16:09:54] <volans>	 so for the last minute intrusion... but if we just got 2 live disks, wouldn't be safer to depool it so it's ready for any kind of hardware test?
[16:10:04] <volans>	 btw FYI papaul is out AFAIK f
[16:10:04] <volans>	 btw FYI papaul is out AFAIK f
[16:10:34] <godog>	 gah, that's true! thank you for pointing it out volans 
[16:10:34] <godog>	 gah, that's true! thank you for pointing it out volans 
[16:10:37] <volans>	 lol, godog I hit enter before getting your last message :D
[16:10:37] <volans>	 lol, godog I hit enter before getting your last message :D
[16:10:54] <godog>	 haha!
[16:10:54] <godog>	 haha!
[16:12:33] <volans>	 but yeah if has to stay for a bit like that I agree better to have it out of the swift active pool, if it can stay on it will not be nuked by puppetdb ;)
[16:12:33] <volans>	 but yeah if has to stay for a bit like that I agree better to have it out of the swift active pool, if it can stay on it will not be nuked by puppetdb ;)
[16:12:38] <volans>	 *from
[16:12:38] <volans>	 *from
[16:13:28] <godog>	 mmhh yeah, likely safer to nuke the filesystems + poweroff the host and remove it from swift ring too
[16:13:28] <godog>	 mmhh yeah, likely safer to nuke the filesystems + poweroff the host and remove it from swift ring too
[16:14:02] <godog>	 since I'll be OOO in two days I'm summoning cdanis which has graciously helped with swift in the past 
[16:14:02] <godog>	 since I'll be OOO in two days I'm summoning cdanis which has graciously helped with swift in the past 
[16:14:35] <godog>	 Emperor: for some swift introductions (ha ha), I don't think you have met cdanis yet ?
[16:14:35] <godog>	 Emperor: for some swift introductions (ha ha), I don't think you have met cdanis yet ?
[16:16:24] <cdanis>	 hi! nice to meet you, I'm just back from leave :)
[16:16:24] <cdanis>	 hi! nice to meet you, I'm just back from leave :)
[16:18:49] <Emperor>	 hi :) 
[16:18:50] <Emperor>	 hi :) 
[16:18:57] * Emperor is due to go and cook dinner fairly shortly
[16:18:57] * Emperor is due to go and cook dinner fairly shortly
[16:19:24] <godog>	 ack, yeah we can resume tomorrow no problem, I have to go in not too long too
[16:19:24] <godog>	 ack, yeah we can resume tomorrow no problem, I have to go in not too long too
[16:20:25] <Emperor>	 if it's more convenient, I'm happy for you to continue and read scroll 
[16:20:26] <Emperor>	 if it's more convenient, I'm happy for you to continue and read scroll 
[16:21:58] <godog>	 sure I'll blabb^Wthink out loud here for a little while longer!
[16:21:58] <godog>	 sure I'll blabb^Wthink out loud here for a little while longer!
[16:22:21] <Emperor>	 👍
[16:22:21] <Emperor>	 👍
[16:30:09] <godog>	 basically yeah I think we're stable for now, tomorrow EU morning we can resume, but so far I'm thinking about nuking everything and powering down
[16:30:09] <godog>	 basically yeah I think we're stable for now, tomorrow EU morning we can resume, but so far I'm thinking about nuking everything and powering down
[16:38:09] <godog>	 also for the record, I stand corrected re: netconsole, we do have some servers with netconsole enabled but not fleet wide - cfr https://phabricator.wikimedia.org/T242579 
[16:38:09] <godog>	 also for the record, I stand corrected re: netconsole, we do have some servers with netconsole enabled but not fleet wide - cfr https://phabricator.wikimedia.org/T242579 
[16:48:14] <godog>	 ok gotta go! 
[16:48:14] <godog>	 ok gotta go! 
[16:48:50] <Emperor>	 〜
[16:48:50] <Emperor>	 〜
[17:25:15] <razzi>	 Hi all, looking in to https://phabricator.wikimedia.org/T290841 - "dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error" - thanks for creating the ticket
[17:25:15] <razzi>	 Hi all, looking in to https://phabricator.wikimedia.org/T290841 - "dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error" - thanks for creating the ticket
[17:25:36] <razzi>	 I'm curious about restarting mysqld to see if that improves the situation at all, would that be a good step?
[17:25:37] <razzi>	 I'm curious about restarting mysqld to see if that improves the situation at all, would that be a good step?
[17:27:22] <jynus>	 razzi, that is the expected step- short of researching the root cause :-)
[17:27:22] <jynus>	 razzi, that is the expected step- short of researching the root cause :-)
[17:28:24] <razzi>	 sg jynus, I haven't done this on this particular host, I'm guessing a systemctl restart would do the trick?
[17:28:24] <razzi>	 sg jynus, I haven't done this on this particular host, I'm guessing a systemctl restart would do the trick?
[17:28:45] <jynus>	 razzi, the main issue is we don't know user affection
[17:28:45] <jynus>	 razzi, the main issue is we don't know user affection
[17:28:59] <jynus>	 you should know how it will impact users, warn them, etc.
[17:28:59] <jynus>	 you should know how it will impact users, warn them, etc.
[17:29:16] <jynus>	 and then disable alerts that would fire on icinga
[17:29:16] <jynus>	 and then disable alerts that would fire on icinga
[17:29:20] <razzi>	 gotcha gotcha
[17:29:20] <razzi>	 gotcha gotcha
[17:30:29] <jynus>	 e.g. maybe you can do it at any time, maybe you have a maintenance procedures, that is kind-of the "difficult thing" that your team will know
[17:30:29] <jynus>	 e.g. maybe you can do it at any time, maybe you have a maintenance procedures, that is kind-of the "difficult thing" that your team will know
[17:30:45] <jynus>	 but yeah, in the end it will be just systemd
[17:30:45] <jynus>	 but yeah, in the end it will be just systemd
[17:30:48] <razzi>	 My thinking is that for these kind of hosts that do asynchronous queries, a query erroring out isn't going to cause anybody more than a few minutes of inconvenience, but I'll keep thinking about it for a bit
[17:30:48] <razzi>	 My thinking is that for these kind of hosts that do asynchronous queries, a query erroring out isn't going to cause anybody more than a few minutes of inconvenience, but I'll keep thinking about it for a bit
[17:31:12] <razzi>	 Definitely something unusual is going on, haven't known these hosts to have memory pressure
[17:31:12] <razzi>	 Definitely something unusual is going on, haven't known these hosts to have memory pressure
[17:31:32] <jynus>	 I think we have a documented procedure for mw dbs
[17:31:32] <jynus>	 I think we have a documented procedure for mw dbs
[17:31:36] <jynus>	 I can show you that
[17:31:36] <jynus>	 I can show you that
[17:31:48] <jynus>	 but not all steps may apply to analytics/DE
[17:31:48] <jynus>	 but not all steps may apply to analytics/DE
[17:31:54] <jynus>	 let me find it
[17:31:54] <jynus>	 let me find it
[17:32:19] <razzi>	 apologies if I'm totally off base, but since this is analytics and doesn't serve application traffic it should be even less risky
[17:32:19] <razzi>	 apologies if I'm totally off base, but since this is analytics and doesn't serve application traffic it should be even less risky
[17:32:29] <jynus>	 yeah, I agree
[17:32:29] <jynus>	 yeah, I agree
[17:32:38] <jynus>	 was looking it just if it helps for guidance
[17:32:38] <jynus>	 was looking it just if it helps for guidance
[17:32:53] <jynus>	 e.g. alert downtime, etc. as we have a complete checklist already predone
[17:32:53] <jynus>	 e.g. alert downtime, etc. as we have a complete checklist already predone
[17:33:12] <razzi>	 cool cool
[17:33:12] <razzi>	 cool cool
[17:34:00] <jynus>	 https://wikitech.wikimedia.org/wiki/MariaDB/Start_and_stop
[17:34:00] <jynus>	 https://wikitech.wikimedia.org/wiki/MariaDB/Start_and_stop
[17:35:11] <razzi>	 ty
[17:35:11] <razzi>	 ty
[17:49:39] <razzi>	 Ok, I'm going to proceed with restarting, will post the command here before I run them just to be safe
[17:49:39] <razzi>	 Ok, I'm going to proceed with restarting, will post the command here before I run them just to be safe
[17:50:41] <jynus>	 did you downtime the host in advance so it doesn't page all SREs?
[17:50:41] <jynus>	 did you downtime the host in advance so it doesn't page all SREs?
[17:50:48] <razzi>	 I'll be downtiming, yeah
[17:50:48] <razzi>	 I'll be downtiming, yeah
[17:51:17] <jynus>	 that's almost more important than the restart :-P
[17:51:17] <jynus>	 that's almost more important than the restart :-P
[17:51:22] <razzi>	 haha yeah
[17:51:22] <razzi>	 haha yeah
[17:51:32] <razzi>	 Random / off topic question, looking at https://noc.wikimedia.org/db.php, I don't see s3, what's up with that?
[17:51:32] <razzi>	 Random / off topic question, looking at https://noc.wikimedia.org/db.php, I don't see s3, what's up with that?
[17:51:46] <razzi>	 (Am I missing something obvious?)
[17:51:46] <razzi>	 (Am I missing something obvious?)
[17:51:54] <jynus>	 s3 technically is called DEFAULT on mediawiki config
[17:51:54] <jynus>	 s3 technically is called DEFAULT on mediawiki config
[17:52:03] <razzi>	 gotcha
[17:52:03] <razzi>	 gotcha
[17:52:19] <jynus>	 but it is known everywhere as s3, even on mw dblists
[17:52:20] <jynus>	 but it is known everywhere as s3, even on mw dblists
[17:53:19] <jynus>	 e.g.: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s3.dblist
[17:53:19] <jynus>	 e.g.: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s3.dblist
[17:53:26] <jynus>	 I can see it is confusing
[17:53:26] <jynus>	 I can see it is confusing
[17:53:40] <razzi>	 cool cool
[17:53:41] <razzi>	 cool cool
[17:54:22] <razzi>	 Going to downtime the host with a `sudo icinga-downtime -h dbstore1007 -d 3600 -r "reboot mariadb in an attempt to free up memory"`
[17:54:22] <razzi>	 Going to downtime the host with a `sudo icinga-downtime -h dbstore1007 -d 3600 -r "reboot mariadb in an attempt to free up memory"`
[17:55:14] <jynus>	 looks good to me :-)
[17:55:14] <jynus>	 looks good to me :-)
[17:56:03] <jynus>	 1-2 hours is a good default, even if it shouldn't take more than 5 minutes or so
[17:56:03] <jynus>	 1-2 hours is a good default, even if it shouldn't take more than 5 minutes or so
[17:56:34] <razzi>	 cool
[17:56:34] <razzi>	 cool
[17:56:48] <razzi>	 host is downtimed
[17:56:48] <razzi>	 host is downtimed
[17:57:04] <razzi>	 now for the mariadb restarts
[17:57:05] <razzi>	 now for the mariadb restarts
[17:57:31] <jynus>	 so in theory, a systemctl call would be enough
[17:57:31] <jynus>	 so in theory, a systemctl call would be enough
[17:57:46] <razzi>	 So I see there's mariadb@s{2,3,4} and just plain mariadb
[17:57:46] <razzi>	 So I see there's mariadb@s{2,3,4} and just plain mariadb
[17:57:57] <jynus>	 but on the link I sent you, we mention that, out of precaution, we recommend stoping replication
[17:57:57] <jynus>	 but on the link I sent you, we mention that, out of precaution, we recommend stoping replication
[17:58:09] <razzi>	 ah right, let me look at the replication state
[17:58:09] <razzi>	 ah right, let me look at the replication state
[17:58:11] <jynus>	 to make sure there is no ongoing alter table or other blocking process ongoing
[17:58:11] <jynus>	 to make sure there is no ongoing alter table or other blocking process ongoing
[17:59:15] <jynus>	 this is one of those things that 99% of the cases is not really necesary but can prevent a big headache if something out of the ordinary is ongoing
[17:59:15] <jynus>	 this is one of those things that 99% of the cases is not really necesary but can prevent a big headache if something out of the ordinary is ongoing
[18:00:24] <jynus>	 for socket in /run/mysqld/*; do mysql --socket=$socket -e "STOP SLAVE"; done
[18:00:24] <jynus>	 for socket in /run/mysqld/*; do mysql --socket=$socket -e "STOP SLAVE"; done
[18:01:11] <razzi>	 Cool, thanks for that snippet
[18:01:12] <razzi>	 Cool, thanks for that snippet
[18:02:30] <razzi>	 Also semi off topic but just for my curiosity, what's the difference between for example /var/run/mysqld/mysqld.s2.sock and /run/mysqld/mysqld.s2.sock ?
[18:02:30] <razzi>	 Also semi off topic but just for my curiosity, what's the difference between for example /var/run/mysqld/mysqld.s2.sock and /run/mysqld/mysqld.s2.sock ?
[18:02:42] <razzi>	 is there some sort of linking between /var and / ?
[18:02:42] <razzi>	 is there some sort of linking between /var and / ?
[18:02:47] <jynus>	 an, /var/run is /run
[18:02:47] <jynus>	 an, /var/run is /run
[18:02:57] <jynus>	 but /run is the lsb standard
[18:02:57] <jynus>	 but /run is the lsb standard
[18:03:15] <jynus>	 atm machine, I know :-P
[18:03:15] <jynus>	 atm machine, I know :-P
[18:03:17] <razzi>	 ok, all replication is disabled
[18:03:17] <razzi>	 ok, all replication is disabled
[18:03:27] <razzi>	 Now to restart mariadb
[18:03:27] <razzi>	 Now to restart mariadb
[18:03:39] <jynus>	 yep, that is the standard way
[18:03:39] <jynus>	 yep, that is the standard way
[18:03:46] <jynus>	 we recomment to start with replication disabled
[18:03:46] <jynus>	 we recomment to start with replication disabled
[18:03:56] <jynus>	 after a crash, but not really needed for a normal restart
[18:03:56] <jynus>	 after a crash, but not really needed for a normal restart
[18:04:23] <razzi>	 For that I'll just `sudo systemctl restart mariadb@s2.service`
[18:04:23] <razzi>	 For that I'll just `sudo systemctl restart mariadb@s2.service`
[18:04:26] <jynus>	 there may be an extra step because of a weird status, depending on the mariadb version
[18:04:26] <jynus>	 there may be an extra step because of a weird status, depending on the mariadb version
[18:04:27] <razzi>	 and repeat for the other sections
[18:04:27] <razzi>	 and repeat for the other sections
[18:04:28] <jynus>	 +1
[18:04:28] <jynus>	 +1
[18:04:34] <razzi>	 here goes!
[18:04:34] <razzi>	 here goes!
[18:04:45] <jynus>	 if all goes well, no output :-)
[18:04:45] <jynus>	 if all goes well, no output :-)
[18:04:45] <razzi>	 I'm going to do one at a time and wait a few minutes
[18:04:45] <razzi>	 I'm going to do one at a time and wait a few minutes
[18:05:00] <jynus>	 you can of course tail journalctl
[18:05:00] <jynus>	 you can of course tail journalctl
[18:05:06] <razzi>	 Do y'all have a !log command for this?
[18:05:06] <razzi>	 Do y'all have a !log command for this?
[18:05:26] <jynus>	 not a standard one, I would just say "restarting X server BUG"
[18:05:26] <jynus>	 not a standard one, I would just say "restarting X server BUG"
[18:05:26] <razzi>	 I'm not totally clear on where / who does this
[18:05:26] <razzi>	 I'm not totally clear on where / who does this
[18:05:34] <jynus>	 you can do it :-)
[18:05:34] <jynus>	 you can do it :-)
[18:05:35] <razzi>	 usually I post in our #wikimedia-analytics and #wikimedia-ops
[18:05:36] <razzi>	 usually I post in our #wikimedia-analytics and #wikimedia-ops
[18:05:50] <jynus>	 yeah, logging here is nice, imagine something weird happens
[18:05:50] <jynus>	 yeah, logging here is nice, imagine something weird happens
[18:05:55] <jynus>	 or you forget to downtime or something
[18:05:55] <jynus>	 or you forget to downtime or something
[18:06:01] <jynus>	 here as in -operations
[18:06:01] <jynus>	 here as in -operations
[18:06:03] <razzi>	 err #wikimedia-operations
[18:06:03] <razzi>	 err #wikimedia-operations
[18:06:03] <razzi>	 yeah
[18:06:04] <razzi>	 yeah
[18:06:13] <razzi>	 !log sudo systemctl restart mariadb@s2.service
[18:06:13] <razzi>	 !log sudo systemctl restart mariadb@s2.service
[18:06:14] <stashbot>	 razzi: Not expecting to hear !log here
[18:06:14] <stashbot>	 razzi: Not expecting to hear !log here
[18:06:16] <jynus>	 he hr sorrt
[18:06:17] <jynus>	 he hr sorrt
[18:06:26] <razzi>	 ok stashbot sry :)
[18:06:26] <razzi>	 ok stashbot sry :)
[18:07:00] <jynus>	 so if you add the bug at the end it comments on phab automatically, maybe you didn't know that
[18:07:00] <jynus>	 so if you add the bug at the end it comments on phab automatically, maybe you didn't know that
[18:07:09] <jynus>	 you can always add a coment manually :-)
[18:07:09] <jynus>	 you can always add a coment manually :-)
[18:07:30] <razzi>	 Ah yes right
[18:07:30] <razzi>	 Ah yes right
[18:07:48] <razzi>	 Cool, that one section freed up about 100G
[18:07:48] <razzi>	 Cool, that one section freed up about 100G
[18:07:50] <jynus>	 I am being super-purist and nitpicky here because you ask, in practice, we don't really care
[18:07:50] <jynus>	 I am being super-purist and nitpicky here because you ask, in practice, we don't really care
[18:07:57] <jynus>	 ;-)
[18:07:57] <jynus>	 ;-)
[18:08:02] <razzi>	 https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=4&orgId=1&var-server=dbstore1007&var-datasource=thanos&var-cluster=misc&from=now-1h&to=now
[18:08:02] <razzi>	 https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=4&orgId=1&var-server=dbstore1007&var-datasource=thanos&var-cluster=misc&from=now-1h&to=now
[18:08:23] <razzi>	 yeah I'm def down for feedback on best practices and good habits :)
[18:08:23] <razzi>	 yeah I'm def down for feedback on best practices and good habits :)
[18:08:44] <razzi>	 I just hit my 1 year here but I still feel like a beginner :P
[18:08:44] <razzi>	 I just hit my 1 year here but I still feel like a beginner :P
[18:08:52] <jynus>	 like, if you had done systemctl restart and done no alert, we wouldn't notice it :-)
[18:08:52] <jynus>	 like, if you had done systemctl restart and done no alert, we wouldn't notice it :-)
[18:09:05] <jynus>	 but do as I say, not as I do :-D
[18:09:05] <jynus>	 but do as I say, not as I do :-D
[18:09:10] <razzi>	 haha
[18:09:10] <razzi>	 haha
[18:09:28] <jynus>	 the graph that led me to get worried was: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=18&orgId=1&var-server=dbstore1007&var-datasource=thanos&var-cluster=misc&from=1630327522248&to=1631516741210
[18:09:28] <jynus>	 the graph that led me to get worried was: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=18&orgId=1&var-server=dbstore1007&var-datasource=thanos&var-cluster=misc&from=1630327522248&to=1631516741210
[18:09:46] <jynus>	 I happen to look at dbstores because I happen to use the same puppet role for backup generation :-D
[18:09:46] <jynus>	 I happen to look at dbstores because I happen to use the same puppet role for backup generation :-D
[18:09:51] <razzi>	 Could you explain that one a bit to me?
[18:09:51] <razzi>	 Could you explain that one a bit to me?
[18:09:58] <jynus>	 ah, ofc
[18:09:59] <jynus>	 ah, ofc
[18:10:20] <jynus>	 that's the frequency (number of times per second) that data is being sent to disk for swapping
[18:10:20] <jynus>	 that's the frequency (number of times per second) that data is being sent to disk for swapping
[18:10:25] <razzi>	 ok cool
[18:10:25] <razzi>	 ok cool
[18:10:30] <razzi>	 so ideally it's 0
[18:10:30] <razzi>	 so ideally it's 0
[18:10:49] <razzi>	 Looks like restarting s2 had a good effect, I'm going to move on to s3
[18:10:49] <razzi>	 Looks like restarting s2 had a good effect, I'm going to move on to s3
[18:10:54] <jynus>	 for the config we have- very low swappiness and the type of service (mysql) yes
[18:10:54] <jynus>	 for the config we have- very low swappiness and the type of service (mysql) yes
[18:11:00] <razzi>	 good effect and didn't cause any problem
[18:11:00] <razzi>	 good effect and didn't cause any problem
[18:11:18] <razzi>	 well nothing I know of yet :-) 
[18:11:19] <razzi>	 well nothing I know of yet :-) 
[18:11:35] <razzi>	 but honestly I feel like the uptime of this kind of host doesn't have to be some crazy 99.99999999
[18:11:35] <razzi>	 but honestly I feel like the uptime of this kind of host doesn't have to be some crazy 99.99999999
[18:11:38] <jynus>	 yeah, I think also m*nuel deployed lower memory usage per instance
[18:11:38] <jynus>	 yeah, I think also m*nuel deployed lower memory usage per instance
[18:12:33] <jynus>	 so we try to detect potential swapping when there is a lot of db usage
[18:12:33] <jynus>	 so we try to detect potential swapping when there is a lot of db usage
[18:12:37] <jynus>	 *memory usage
[18:12:37] <jynus>	 *memory usage
[18:12:52] <razzi>	 !log razzi@dbstore1007:~$ sudo systemctl restart mariadb@s3.service for T290841
[18:12:52] <razzi>	 !log razzi@dbstore1007:~$ sudo systemctl restart mariadb@s3.service for T290841
[18:12:52] <stashbot>	 razzi: Not expecting to hear !log here
[18:12:52] <stashbot>	 razzi: Not expecting to hear !log here
[18:12:52] <stashbot>	 T290841: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841
[18:12:52] <stashbot>	 T290841: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841
[18:12:56] <jynus>	 because in production, it would mean it is very, very slow to respond to queries
[18:12:56] <jynus>	 because in production, it would mean it is very, very slow to respond to queries
[18:13:05] <RhinosF1>	 Emperor: you might want to downtime ms-be2045 if it's known to be broke as it just moaned about disk space
[18:13:05] <RhinosF1>	 Emperor: you might want to downtime ms-be2045 if it's known to be broke as it just moaned about disk space
[18:13:51] <jynus>	 no need to log here, when I meant here, I meant -operations, sorry for the confusion
[18:13:51] <jynus>	 no need to log here, when I meant here, I meant -operations, sorry for the confusion
[18:14:28] <razzi>	 jynus: another Memory: saturation graph question, what's the difference between swpin and swpout? I figure they should be pretty much the same
[18:14:28] <razzi>	 jynus: another Memory: saturation graph question, what's the difference between swpin and swpout? I figure they should be pretty much the same
[18:14:33] <razzi>	 as much swap is written should be freed
[18:14:33] <razzi>	 as much swap is written should be freed
[18:14:50] <jynus>	 one is writing to swap, the other is reading from swap
[18:14:50] <jynus>	 one is writing to swap, the other is reading from swap
[18:14:55] <razzi>	 oh gotcha
[18:14:55] <razzi>	 oh gotcha
[18:14:56] <jynus>	 both are bad signs
[18:14:56] <jynus>	 both are bad signs
[18:15:00] <razzi>	 so it's not swap write / free
[18:15:00] <razzi>	 so it's not swap write / free
[18:15:04] <razzi>	 write / read
[18:15:04] <razzi>	 write / read
[18:15:38] <jynus>	 I don't know what allocstalls is, but I can read the kernel docs for it
[18:15:38] <jynus>	 I don't know what allocstalls is, but I can read the kernel docs for it
[18:15:44] <razzi>	 re: !log, sure sure, just wanted to communicate I ran it, could have posted without the !log :)
[18:15:44] <razzi>	 re: !log, sure sure, just wanted to communicate I ran it, could have posted without the !log :)
[18:16:03] <razzi>	 a great favor would be to point to me where /I/ could look up such a keyword
[18:16:03] <razzi>	 a great favor would be to point to me where /I/ could look up such a keyword
[18:16:13] <jynus>	 I just googled it
[18:16:13] <jynus>	 I just googled it
[18:16:21] <razzi>	 haha that's what I'd have done xD
[18:16:21] <razzi>	 haha that's what I'd have done xD
[18:16:40] <jynus>	 "threads entering direct reclaim"
[18:16:40] <jynus>	 "threads entering direct reclaim"
[18:18:04] <jynus>	 the thing is, that is one way to see it, the other way is, that if we go over a certain threshold, the OOM will kick in and kill the process
[18:18:04] <jynus>	 the thing is, that is one way to see it, the other way is, that if we go over a certain threshold, the OOM will kick in and kill the process
[18:18:19] <razzi>	 yep, definitely don't want that!
[18:18:19] <razzi>	 yep, definitely don't want that!
[18:18:26] <jynus>	 mysql at that point is usually unusable
[18:18:26] <jynus>	 mysql at that point is usually unusable
[18:18:35] <jynus>	 we want to catch it before that happens
[18:18:35] <jynus>	 we want to catch it before that happens
[18:18:45] <razzi>	 Restarting s3 had even more impact, memory usage is down to under 50%
[18:18:45] <razzi>	 Restarting s3 had even more impact, memory usage is down to under 50%
[18:18:49] <jynus>	 as we want memory buffers in memory, not on disk
[18:18:49] <jynus>	 as we want memory buffers in memory, not on disk
[18:19:02] <jynus>	 and if there is high memory pressure, just reduce the memory buffers of mysql
[18:19:03] <jynus>	 and if there is high memory pressure, just reduce the memory buffers of mysql
[18:19:05] <razzi>	 Going to restart s4
[18:19:06] <razzi>	 Going to restart s4
[18:19:24] <jynus>	 however, there seems to be some kind of leak that reserves more and more memory and doesn't release it
[18:19:24] <jynus>	 however, there seems to be some kind of leak that reserves more and more memory and doesn't release it
[18:19:37] <jynus>	 could be a mariadb bug, or some users process 
[18:19:37] <jynus>	 could be a mariadb bug, or some users process 
[18:19:38] <razzi>	 yeah totally
[18:19:39] <razzi>	 yeah totally
[18:19:54] <razzi>	 s4 restart completed fine
[18:19:54] <razzi>	 s4 restart completed fine
[18:19:56] <jynus>	 e.g. some stored procedure, etc.
[18:19:56] <jynus>	 e.g. some stored procedure, etc.
[18:20:21] <jynus>	 maybe after manuel's adjustment it doesn't happen again, who knows?
[18:20:21] <jynus>	 maybe after manuel's adjustment it doesn't happen again, who knows?
[18:20:26] <jynus>	 so check replication is running
[18:20:27] <jynus>	 so check replication is running
[18:20:48] <jynus>	 if not, the same as before but with START SLAVE
[18:20:48] <jynus>	 if not, the same as before but with START SLAVE
[18:21:05] <razzi>	 another question, how to parse `show slave status;` ?
[18:21:06] <razzi>	 another question, how to parse `show slave status;` ?
[18:21:10] <razzi>	 sooo many columns
[18:21:10] <razzi>	 sooo many columns
[18:21:17] <jynus>	 SHOW SLAVE STATUS\G
[18:21:17] <jynus>	 SHOW SLAVE STATUS\G
[18:21:21] <jynus>	 without semicolon
[18:21:21] <jynus>	 without semicolon
[18:21:23] <jynus>	 :-)
[18:21:24] <jynus>	 :-)
[18:21:24] <razzi>	 ohh
[18:21:24] <razzi>	 ohh
[18:21:37] <razzi>	 wow that's so much better
[18:21:37] <razzi>	 wow that's so much better
[18:21:40] <razzi>	 and such a weird syntax
[18:21:41] <razzi>	 and such a weird syntax
[18:21:48] <jynus>	 you can also do something like
[18:21:48] <jynus>	 you can also do something like
[18:21:59] <razzi>	 I saw that on a doc and thought "looks like some control character garbage got copied by mistake"... little did I know...
[18:21:59] <razzi>	 I saw that on a doc and thought "looks like some control character garbage got copied by mistake"... little did I know...
[18:22:24] <jynus>	 pager grep Running
[18:22:24] <jynus>	 pager grep Running
[18:22:25] <razzi>	 ok cool, as expected it's not replicating
[18:22:25] <razzi>	 ok cool, as expected it's not replicating
[18:22:30] <jynus>	 and it will grep the output
[18:22:30] <jynus>	 and it will grep the output
[18:22:32] <razzi>	 will do the reverse of your bash for loop
[18:22:33] <razzi>	 will do the reverse of your bash for loop
[18:22:47] <jynus>	 pager is usually use for something like "pager less"
[18:22:47] <jynus>	 pager is usually use for something like "pager less"
[18:22:56] <jynus>	 but it is nice for grepping and other stuff 
[18:22:57] <jynus>	 but it is nice for grepping and other stuff 
[18:22:57] <razzi>	 pretty embarassingly simple question, is the replication the replication /to/ dbstore1007?
[18:22:57] <razzi>	 pretty embarassingly simple question, is the replication the replication /to/ dbstore1007?
[18:23:04] <jynus>	 yep
[18:23:04] <jynus>	 yep
[18:23:06] <razzi>	 cool
[18:23:06] <razzi>	 cool
[18:23:12] <jynus>	 you setup on the host were you replicate from
[18:23:12] <jynus>	 you setup on the host were you replicate from
[18:23:20] <jynus>	 it is like a pull system
[18:23:20] <jynus>	 it is like a pull system
[18:23:30] <razzi>	 is there a way to ask a host "who's following you"?
[18:23:30] <razzi>	 is there a way to ask a host "who's following you"?
[18:23:36] <jynus>	 yes
[18:23:36] <jynus>	 yes
[18:23:45] <jynus>	 show slave HOSTS; I think
[18:23:45] <jynus>	 show slave HOSTS; I think
[18:24:19] <jynus>	 long time since I was a DBA, so please don't trust me 100%
[18:24:20] <jynus>	 long time since I was a DBA, so please don't trust me 100%
[18:24:22] <razzi>	 for socket in /run/mysqld/*; do sudo mysql --socket=$socket -e "START SLAVE"; done
[18:24:22] <razzi>	 for socket in /run/mysqld/*; do sudo mysql --socket=$socket -e "START SLAVE"; done
[18:24:29] <razzi>	 Should be all set
[18:24:29] <razzi>	 Should be all set
[18:24:35] <jynus>	 yeah, that will work even if it was running already
[18:24:36] <jynus>	 yeah, that will work even if it was running already
[18:24:50] <jynus>	 so normally that would be the end of it
[18:24:50] <jynus>	 so normally that would be the end of it
[18:24:58] <jynus>	 but apparently there is a bug on some packages
[18:24:58] <jynus>	 but apparently there is a bug on some packages
[18:25:11] <jynus>	 that don't restart prometheus-mysqld-exporter
[18:25:11] <jynus>	 that don't restart prometheus-mysqld-exporter
[18:25:32] <jynus>	 let me check for you if graphs are coming to prometheus
[18:25:32] <jynus>	 let me check for you if graphs are coming to prometheus
[18:25:41] <jynus>	 meanwhile, make sure icinga is happy with the host :-)
[18:25:41] <jynus>	 meanwhile, make sure icinga is happy with the host :-)
[18:26:24] <razzi>	 Is prometheus the data source for grafana?
[18:26:24] <razzi>	 Is prometheus the data source for grafana?
[18:26:28] <razzi>	 That's been updating fine
[18:26:28] <razzi>	 That's been updating fine
[18:26:42] <jynus>	 yes
[18:26:43] <jynus>	 yes
[18:26:57] <razzi>	 cool, graph looks good by the way, memory is down to 14%
[18:26:58] <razzi>	 cool, graph looks good by the way, memory is down to 14%
[18:27:08] <jynus>	 so then that would be it, I think some package versions had a bug of not restarting prometheus exporter or ending in a bad state
[18:27:08] <jynus>	 so then that would be it, I think some package versions had a bug of not restarting prometheus exporter or ending in a bad state
[18:27:10] <razzi>	 it's increasing but hopefully will flatten out at some point less than 90%
[18:27:10] <razzi>	 it's increasing but hopefully will flatten out at some point less than 90%
[18:27:13] <jynus>	 if mysql restarted
[18:27:13] <jynus>	 if mysql restarted
[18:27:37] <razzi>	 alright, going to take off the downtime
[18:27:37] <razzi>	 alright, going to take off the downtime
[18:28:03] <jynus>	 all green on icinga, yep
[18:28:03] <jynus>	 all green on icinga, yep
[18:28:47] <jynus>	 one cannot be careful enough- e.g. users complaining because they just started a big query, unnecesary alert noise, instances don't come back up :-)
[18:28:47] <jynus>	 one cannot be careful enough- e.g. users complaining because they just started a big query, unnecesary alert noise, instances don't come back up :-)
[18:29:24] <razzi>	 downtime is removed
[18:29:24] <razzi>	 downtime is removed
[18:29:38] <jynus>	 so that should be all really, I think
[18:29:39] <jynus>	 so that should be all really, I think
[18:29:45] <razzi>	 yep yep, I've definitely set off storms of alerts before...
[18:29:45] <razzi>	 yep yep, I've definitely set off storms of alerts before...
[18:30:11] <razzi>	 yeah I'll keep an eye on the graph, but hopefully it won't get to that scary oom point anytime soon
[18:30:11] <razzi>	 yeah I'll keep an eye on the graph, but hopefully it won't get to that scary oom point anytime soon
[18:30:21] <jynus>	 no need, really
[18:30:21] <jynus>	 no need, really
[18:30:29] <jynus>	 so the alert was preciselly to warn us
[18:30:29] <jynus>	 so the alert was preciselly to warn us
[18:30:32] <razzi>	 I guess it'll alert if it needs attention
[18:30:32] <razzi>	 I guess it'll alert if it needs attention
[18:30:33] <razzi>	 yeah
[18:30:33] <razzi>	 yeah
[18:30:37] <jynus>	 yep
[18:30:37] <jynus>	 yep
[18:30:53] <razzi>	 cool, thanks for all the help and knowledge jynus!
[18:30:53] <razzi>	 cool, thanks for all the help and knowledge jynus!
[18:31:01] <jynus>	 no, thanks to you for working on this!
[18:31:01] <jynus>	 no, thanks to you for working on this!
[18:31:10] <razzi>	 :)
[18:31:10] <razzi>	 :)
[18:31:16] <razzi>	 I'm off to lunch, catch y'all later!
[18:31:16] <razzi>	 I'm off to lunch, catch y'all later!
[18:31:17] <jynus>	 maybe in some time the parent can be closed too
[18:31:17] <jynus>	 maybe in some time the parent can be closed too
[18:31:20] <jynus>	 bye!
[18:31:20] <jynus>	 bye!
[18:31:21] <razzi>	 ah yes
[18:31:21] <razzi>	 ah yes
[18:32:25] <jynus>	 feel free to close T290841 yourself or your manager, or however your team does :-D
[18:32:25] <jynus>	 feel free to close T290841 yourself or your manager, or however your team does :-D
[18:32:25] <stashbot>	 T290841: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841
[18:32:26] <stashbot>	 T290841: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841