[09:52:09] https://www.tomsguide.com/news/zoom-security-privacy-woes I am leaving this here; there are still some trainings etc that use zoom, and zoom app security is still an issue, and as someone with escalated privileges on the cluster I still won't install it. [11:19:57] Emperor: plan to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/849595 unless you have objections [11:39:19] jbond: please do (I just +1d your revised version) [11:39:26] ack doing so now cheers [11:39:57] done [11:53:54] apergos: interesting [11:53:59] My work ban zoom [11:56:09] Well heavily discouraged for anything but allowed for public conferences [12:51:02] all the puppetmasters on cloud vps started failing to run puppet, saying something about fetch_swift_rings ca_server having an unexpected value, anyone is working on it? [12:53:34] I think it might be https://gerrit.wikimedia.org/r/c/operations/puppet/+/852767, adding the type Stdlib::Host for that param that made the runs fail, cc. jbond vgutierrez [12:54:12] cc taavi [12:55:43] maybe this should be updated: hieradata/cloud.yaml:puppet_ca_server: ~ [12:55:58] Hmmm cloud.yaml had a value in place for puppet_ca_server [12:56:03] (not a Stdilb::Host) [12:56:56] suspe4ct you just need yto change the type definition to Optional[Stdlib::HOst [12:57:01] suspe4ct you just need yto change the type definition to Optional[Stdlib::Host[ [12:57:08] yep [13:00:12] or we could alias puppet_ca_server to puppetmaster in cloud.yaml, that seems like a sensible default for most projects [13:01:28] also works [13:04:58] https://www.irccloud.com/pastebin/wWXOZrYe/ [13:05:41] it added the following entry to puppet.conf (that I think it's expected): [13:05:42] +ca_server = paws-puppetmaster-2.paws.eqiad1.wikimedia.cloud [13:06:30] yeah it is [13:06:50] ca_server defaults to server [13:06:59] So it's effectively a NOOP [13:20:57] are you submitting a CR dcaro? [13:21:04] oh, I was not xd [13:21:39] I can do it after lunch [13:24:38] https://gerrit.wikimedia.org/r/c/operations/puppet/+/852839 there you go :) [13:30:02] Nice... could you get rid of deployment-puppermaster04.yaml? [13:30:19] As part of that CR I mean [13:33:28] is that related? [13:35:05] same thing basically [13:35:17] done [13:36:10] running PCC against a couple of puppetmasters in cloud [13:36:17] 👍 [13:36:24] sigh.. no facts for paws-puppetmaster-2.paws.eqiad1.wikimedia.cloud [13:36:52] and for deployment-puppetmaster04 is a NOOP: https://puppet-compiler.wmflabs.org/pcc-worker1001/37940/ [13:38:45] please also ensure the cloud-puppetmasters in cloudinfra look good [13:39:09] are those exporting facts to the pcc-compilers? [13:39:18] yes [13:39:54] do you have a FQDN handy? [13:40:01] I don't have visibility of that project [13:41:12] cloud-puppetmaster-03.cloudinfra.eqiad.wmflabs,cloud-puppetmaster-05.cloudinfra.eqiad1.wikimedia.cloud [13:41:23] they have slightly different configs so please do both [13:41:35] thx :) [13:42:12] (running PCC as we speak) [13:43:04] taavi: looking good IMHO: https://puppet-compiler.wmflabs.org/pcc-worker1003/37941/ [13:48:27] lgtm [13:52:00] apergos: new dumps failure of the day: https://phabricator.wikimedia.org/T322328 [13:52:32] no idea at all. uses that fancy kerberos stuff [13:53:54] got the error? because I didn't get any error email [13:53:54] dcaro: thx <3 [13:54:11] 👍 [13:58:02] andrewbogott: ? [13:58:24] apergos: the error is showing up on my alert manager dash [13:58:51] and I stopped trying to debug once I saw '/usr/local/bin/systemd-timer-mail-wrapper' [13:58:59] thoughts about who I might ping about that? [13:59:13] do you have the actual error from the systemd unit? at least it could go in the task [14:00:29] Luca Toscano make the kerberos related changes and may know something about that [14:00:39] I added systemd status but it is not helpful [14:00:49] vgutierrez: dcaro: cloud-puppetmaster-03 needs to be able to sign certificates for tenant VMs, would changing ca_server there break it? [14:01:39] so if ca_server isn't currently set [14:01:40] hmmm, as long as it's the same as it was using it should be ok no? [14:01:49] (as in the default value it was using) [14:01:50] setting it to server isn't a change [14:01:53] it's just stating the default value [14:02:44] what's the default value? [14:03:15] ca_server = server [14:03:30] ah [14:03:37] so that should be fine then I think [14:04:58] under the hood this appears to run /usr/local/bin/hdfs-rsync (I am looking at modules/dumps/manifests/web/fetches/analytics/job.pp) [14:05:47] let me double check that actually [14:06:44] yes, so you might try running the whole command manually, with the /usr/local/bin/kerberos-run-command wrapper [14:07:16] usr/local/bin/kerberos-run-command dumpsgen /usr/local/bin/rsync-analytics-pageview_complete_dumps or perhaps the same thing with [14:07:55] /usr/local/bin/hdfs-rsync -r -t --delete --exclude "readme.html" --perms --chmod D755,F644 hdfs:///wmf/data/archive/pageview/complete/ file:///srv/dumps/xmldatadumps/public/other/pageview_complete/ substituted in for the rsync-analytics-pageview_complete_dumps command and see if you get something better, maybe a full stack trace [14:07:57] D755: Add visual error handling to dashboard and detail - https://phabricator.wikimedia.org/D755 [14:08:06] I've never looked at any of this stuff at all, mind [14:09:05] if it failed 8 hours ago, what was working before that and broke then? [14:09:15] andrewbogott: ^^ [14:09:36] I'm in meetings now, sorry -- please tag anyone you an think of on that ticket [14:10:24] ok [15:06:06] who owns cloudmetrics100{1,2}? They're both emailing ever 5 minutes to say ModuleNotFoundError: No module named 'graphite' in /usr/local/sbin/graphite-index [15:12:30] Emperor: usually I'd say just check /etc/wikimedia/contacts.yaml [15:12:48] but in this case that backfires... as they have the spare::system role [15:13:35] 😿 [15:13:39] the naming suggests WMCS, although I thought those hosts were already decomissioned.. cc andrewbogott [15:14:13] yeah I think they were put into spare::system from their original role [15:14:44] https://phabricator.wikimedia.org/T297712 [15:14:49] shall I just go and nobble the cron-jobs, then, if they're spare::system puppet won't put them back? [15:14:58] T297444 [15:14:59] T297444: decommission cloudmetrics100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T297444 [15:15:01] yep, kill whatever you need to kill :) [15:15:18] andrewbogott: anything preventing to run the decom cookbook? [15:15:21] andrewbogott: is there anything remaining on those? [15:15:37] that would be the more complete approach, but maybe there's still stuff needed on the nodes [15:15:38] the decom task is 1y old :( [15:16:03] I am trying to attend the monthly meeting so not fully tuned in here. Probably they can be decom'd. [15:16:18] Have they been mailing every 5 minutes for a year? [15:16:31] Not to me... [15:18:09] [crontabs edited] [15:19:13] thx [15:19:43] [also trying to pay attention to the staff meeting] [17:10:06] folks I am checking conf1008, root partition almost filled [17:10:15] 33G etcd_access.log.1 [17:11:22] 1007 and 1009 seem leading to the same result, from host-overview it seems something gradual over time [17:12:49] as immediate step, I'd say to just truncate the log files to gather say 10G of space [17:13:06] anything against it? [17:13:10] that's weird, all calls to /v2/keys/conftool/v1/mediawiki-config/?recursive=true [17:13:16] yeah [17:13:30] elukey: I just freed another 1% disk with "apt-get clean" [17:13:34] at a speed that i would say are not caching it as it was supposed to [17:13:40] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=conf1009&var-datasource=thanos&var-cluster=etcd&from=now-7d&to=now [17:13:43] volans: --^ [17:13:48] there was a sudden change [17:14:03] yeah I was looking for the same [17:14:08] I bet there was a change in behavuour [17:14:12] elukey: let's just gzip the .log.1 [17:14:29] mutante: that's for later, we need to fix the root cause [17:14:40] MW should not depend on etcd "directly" [17:14:51] Ok, I thought preventing it from running full was before root cause. [17:15:07] PROBLEM - Disk space on conf1008 is CRITICAL alerted a bit ago [17:15:07] there are still 3G left :D [17:15:15] oh I see, reading up now :) [17:15:17] ok then, stepping back [17:15:37] mutante: there is no space for gzipping teh log [17:15:40] volans: let's truncate the logs now to get some breathing room, then we fix the root cause [17:15:51] we can only move it to /srv then gzip it and hte move it back if we want to keep it [17:15:56] not sure it's worth though [17:16:05] yeah if it was that important it'd be in logstash I assume [17:16:06] to keep it I mean [17:16:40] truncate -s 20G? [17:17:10] depends how the log writer works, that could cause problems if it's the live file being written to [17:17:19] we can truncate the .1 only [17:17:25] yeah it is already rotated [17:18:05] I don't see a SAL entry at that time [17:19:51] mmm I see a lot of snapshot10XX nodes in the nginx access log [17:21:03] and that's why I suggested just doing the .1 file like you did as well [17:22:10] tcp6 0 0 snapshot1013.eqia:52598 conf1009.eqiad.wmn:4001 TIME_WAIT - [17:22:13] tcp6 0 0 snapshot1013.eqia:41702 conf1009.eqiad.wmn:4001 TIME_WAIT - [17:22:16] tcp6 0 0 snapshot1013.eqia:39872 conf1007.eqiad.wmn:4001 TIME_WAIT - [17:22:19] a ton of them on 1013 [17:22:21] elukey@snapshot1013:~$ sudo netstat -tuap | grep snapshot | wc -l [17:22:21] 42007 [17:24:30] elukey: I'm analyzing a bit etcd_access.log.2.gz in /srv, I'll let you know what I find [17:24:41] (more clear start time and all client IPs) [17:27:01] do we know why snapshot nodes would fetch /v2/keys/conftool/v1/mediawiki-config/?recursive=true [17:27:04] ? [17:27:30] that's the mediawiki config for connecting to any DB [17:27:47] okok [17:28:04] apergos: o/ around by any chance? [17:28:48] grepping for 01/Nov/2022:08:09 (just 1 minute) returns 4670 requests!!! [17:30:01] do you think it is a script fetching the mw-config and hammering conf100x ? [17:30:26] I think it's the "wikidump" that started on November 1st. [17:30:28] not sure [17:30:33] python3 ./worker.py --configfile /etc/dumps/confs/wikidump.conf.dumps ... [17:30:38] interesting [17:30:38] yeah I see at 8:29 though [17:30:50] not exactly matching, but it is the closest one [17:31:00] if you look inside that /etc/dumps/confs/wikidump.conf.dumps there are all the dblists in the config [17:31:02] do we have it on all snapshot nodes? [17:31:24] and this thing starts at the beginning of the month and is still running [17:31:37] user "dumpsgen" [17:32:27] elukey, mutante: I have to step out for a bit, can I leave this to you for now? [17:32:50] I was about to do the same :D I can stay a bit more yes [17:33:30] is there anybody in the US timezone that could takeover? [17:36:43] I guess the answer is no :D [17:37:32] I think we could leave it to the people oncall once they're donw with the incident [17:37:36] I'll be back in ~40m [17:37:41] sorry gotta go right now [17:37:57] ok so I am truncating the nginx logs, some nodes reached 100% [17:39:04] FYI I ran `sudo truncate -s 20G /var/log/nginx/etcd_access.log.1` on all 100X nodes [17:39:37] so the disk space is now under control [17:39:48] trying to track down how all this links up [17:39:59] we've had some minor impacts on lvs servers from conf1007 being unhealthy [17:40:24] but the disk got full there because of excessive going on with snapshot1013? [17:40:39] vgutierrez: ^ :) [17:40:47] oh :) [17:40:50] it's like IRC channel whackamole [17:40:53] pybal is still struggling [17:41:11] bblack: this is my understanding yes, I am trying to figure out if it is a single snapshot or multiple ones. On snapshot1013 I see a ton of conns in TIME_WAIT [17:41:30] struggling as in getting 502s from conf1007 [17:41:52] yeah cause and effect could be backwards, possibly [17:42:11] vgutierrez: :( https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=conf1009&var-datasource=thanos&var-cluster=etcd&from=now-7d&to=now [17:42:18] all nodes are hammered [17:43:15] I am hearing there was one change put in place regarding the db reload that might affect dumps on snapshot [17:43:18] nothing has changed on our side on the snapshot hosts; when do we see the increase start? [17:43:26] actually pybal is part of that hammer /o\ [17:43:36] because e.g. the db config "reload" (it's not really that) was in place at the start of the run [17:43:38] it's quite aggressive on the reconnection attempts [17:43:45] apergos: hi! See the above graph, I'd say Nov 01 around 8:15 UTC [17:43:55] that's with the start of the run then [17:44:06] only snapshot1013? [17:44:12] because it doesn't do anything special [17:44:16] so far I checked only that one [17:44:26] lemme see the others [17:44:45] marostegui: ping [17:45:04] ^ not sure how it all connects up, but I'm worried the pace of the database maintenance work may be a factor here [17:45:33] can we get a pause there, is that possible? [17:45:34] hey what's up [17:45:50] we're having all kinds of problems centering on confd falling apart [17:45:52] conf1008: 24GB, conf1007,conf1009: 25GB [17:46:05] all conf nodes have large logs. eqiad-only [17:46:06] apergos: I see a ton of conns to conf100x nodes on snapshot10[10,13,12,11] [17:46:07] there's some connection to snapshots hosts, to db reload something something [17:46:11] hmmm I'm no etcd expert but etcd.service crashed on conf1007 due to lack of HD [17:46:17] and I see tons of db maint !log flying by [17:46:18] is it safe to restart it? [17:46:27] vgutierrez: yes let's do it [17:46:29] bblack: it does sound like the dbreload change more and more [17:46:33] marostegui: so I'm starting to suspect the db maint pace may be a driver [17:46:35] apergos: the one you mentioned by Amir [17:46:40] it's not "reload" exactly [17:46:43] bblack: snapshot hosts? those are for XML dumps? [17:46:45] it's affecting lots of other stuff indirectly [17:46:54] bblack: let me stop it then just in case [17:46:54] it's dropping a db when it's out of the pool [17:47:02] yes they are for xml dumps indeed [17:47:07] the process that started on Nov 1 at the time is: python3 ./worker.py --configfile /etc/dumps/confs/wikidump.conf.dumps ... [17:47:18] and it reads all the dblists from that config file [17:47:28] Amir1: let's pause yours too [17:47:30] and I dunno if that change by Amir is really at issue here, it's just the only thing I know that was put int play for the new run [17:47:44] there is something else starting at 08:00 ish "moritzm: installing glibc security updates on buster" [17:48:08] saying it because db maintenance updates I think happen 24/7 [17:48:25] could a securty update caused some daemon reaload or something? [17:48:29] folks can somebody (possibly in US timezone) become IC? [17:48:47] vgutierrez: I see the cluster healthy now, and etcd working :) [17:48:50] denisse|m: around? as bblack is debugging [17:48:51] yep [17:48:56] lvs1020 is happy now again [17:48:58] we're not causing an outage with this, AFAIK [17:48:59] bblack: my maintenance is now stopped, let me check amir's [17:49:06] [yet] [17:49:14] it is not bad to have people around just in case [17:49:17] just internal toil [17:49:20] elukey: i'll take it, you can leave [17:49:27] Yes. [17:49:31] Ack [17:49:33] let others debug, try to keep track [17:49:53] bblack: well we need some coordination, etcd went down on a node and pybal suffered because of it, we are close to cause one in my opinion :) [17:50:37] bblack: stopped also all Amir's maintenance but one thread (which is currently executing a long alter table) but all the pooling/repooling is now stopped [17:50:42] apergos: is it doable to stop the run? [17:50:55] are the pybal connections to etcd alerts expected to clear on their own? e.g. PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 12 connections established with conf1007.eqiad.wmnet:4001 (min=73) [17:51:01] I know it is a pain but at the moment it seems a quick way to decrease the pressure [17:51:28] elukey: I guess but it will pick back up in a few hours [17:51:37] we'll lsoe whatever it's in the middle of [17:51:40] herron: I'm not 100% sure. some will, maybe. we might need restarts, the pybal+etcd integration isn't known for being amazing [17:51:54] mutante: <3 [17:51:56] looking at conf1009 to prevent disk space running out [17:51:58] kk that's what I was wondering essentially if a bounce would be necessary for those to recover [17:52:14] so... tjehculprit IMHO is "wikimedia/multi-http-client v1.0" [17:52:26] yep [17:52:29] -rw-r----- 1 www-data www-data 25G Nov 3 17:52 etcd_access.log [17:52:29] -rw-r----- 1 www-data www-data 20G Nov 3 17:38 etcd_access.log.1 [17:52:41] On phone e [17:52:46] I truncated etcd_access.log.1 earlier on vgutierrez [17:52:57] on all three nodes [17:53:06] (it was around 35G) [17:53:08] Amir1: No worries, I went to your screens and stopped everything that was doing pools/depools, just leave one of them which is still running the big alter [17:53:16] right, the first alert of all this was the confd1007 disk space alert, that's kinda where it all started and the first thing that was mostly-mitigated in the realtime sense [17:53:21] apergos: yes let's stop it if possible, at least we'll know if it is the run or not causing this [17:53:22] so we got 55G of access log in conf1009 [17:53:30] and snapshot1010 is going at it like a maniac [17:54:01] vgutierrez: not only 1010, there are 4 snapshot nodes with tons of TIME_WAIT sockets (like thousands) [17:54:03] but who is accessing etcd? [17:54:04] I'll check it once I'm back [17:54:07] or what? [17:54:19] snapshot10[10,13,12,11] [17:54:20] jynus: something from snapshot instances using UA wikimedia/multi-http-client 1.0 [17:54:31] oh sh*t, I gotta go now, sorry folks [17:54:42] oh, I thought that was accessing dumps [17:54:49] I'm leaving 1008 alone, as it does different sorts of runs [17:54:50] understood [17:54:54] Hmm. I think the reload is put in the wrong for loop [17:54:58] apergos: ack thanks [17:55:06] it's fetching GET /v2/keys/conftool/v1/mediawiki-config/?recursive=true [17:55:33] so potential bugs on mw /dump process? [17:55:50] Amir1: got a link? [17:56:02] I need to look at the logs to be sure [17:56:12] bblack: one sec [17:56:42] they should all be gone now [17:57:14] it makes sense at least, it links dumps and etcd (mw config) [17:57:15] apergos: I see some recovery [17:57:20] Here's the document: https://docs.google.com/document/d/1gq8yOn_d8PhEyjNzMytBvPR39596x8cl4g_4AfwNiQ8/edit?usp=sharing [17:57:27] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=conf1009&var-datasource=thanos&var-cluster=etcd&from=now-30m&to=now [17:57:30] confirmed on snapshot1013 the process is gone. still seeing a ton of connections from there to all conf hosts, all in TIME_WAIT and all tcp6 [17:57:30] apergos: --^ [17:57:50] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/798678/13/includes/export/WikiExporter.php [17:57:54] ah, and now it stopped [17:57:54] I think the doc will be useful to organize all ideas [17:58:09] even if not a proper outage [17:58:10] the connection attempts to conf hosts stopped on snapshot1013 [17:58:20] yep so it was what apergos killed :) [17:58:22] yeah [17:58:29] I was going to say XD [17:58:44] yes [17:58:58] apergos: so it was " python3 ./worker.py --configfile /etc/dumps/confs/wikidump.conf.dumps", was it? [17:59:04] also matches the time frame [17:59:14] there are maintenance scripts that are run [17:59:17] so it was those I expect [17:59:18] apergos: how bad it is to interrupt that? [17:59:19] does it want to talk to conf to get dblists? [17:59:51] example: [17:59:55] /usr/bin/php7.4 /srv/mediawiki/multiversion/MWScript.php dumpBackup.php --wiki=hewiki --dbgroupdefault=dump --full --stub --report=1000 --output=file:/mnt/dumpsdata/xmldatadumps/temp/h/hewiki/hewiki-20221101-stub-meta-history5.xml.gz.inprog_tmp --output=file:/mnt/dumpsdata/xmldatadumps/temp/h/hewiki/hewiki-20221101-stub-meta-current5.xml.gz.inprog_tmp --filter=latest --output=file:/mnt/dumpsdata/xmldatadumps/temp/h/hewiki/hewiki-2 [17:59:55] 0221101-stub-articles5.xml.gz.inprog_tmp --filter=latest --filter=notalk --filter=namespace:!NS_USER --skip-header --start=1466624 --skip-footer --end 1476386 [17:59:58] (sorry) [18:00:24] nothing in the python scripts changed since the last run on the 20th [18:00:28] I assume we don't log etcd reads, it has to be execessive writes right? [18:00:38] (to fill the logs) [18:01:13] yeah, most likly (I am guessing) the logic on dump taking was too overzealous when reloading db configuration [18:01:28] context: https://phabricator.wikimedia.org/T298485 [18:01:42] The last read does reload in processing every row [18:01:51] ups [18:02:06] I will make a patch for that once back [18:02:33] https://phabricator.wikimedia.org/T322156 [18:02:51] Amir1: when you told me you wanted config to be reloaded more frequent, I didn't think you needed it so frequently! :-P [18:03:04] um. yeah :-) [18:03:08] bblack: it's GET /v2/keys/conftool/v1/mediawiki-config/?recursive=true [18:03:11] at least that is an easy fix then [18:03:21] apergos: to be fair, with that bug, backups would had taken too long, at least for medium to large wikis [18:03:27] jynus: I didn't write that part 🥲 [18:03:31] s/backups/dumps/ [18:03:45] Amir1: in any case, I was joking! [18:03:58] as a follow up, we should probably monitor something like https://grafana.wikimedia.org/d/tTE9nvdMk/etcd?orgId=1&from=now-7d&to=now&viewPanel=4 [18:04:04] right now they are semi-broken but that's a side issue, Amir is looking at a fix for that too [18:04:29] I am going to help fill in the incident doc, even if impact was mostly internal [18:05:10] do we need to delete more logs? [18:05:26] in 2 or maybe it is 3 hours, the evening attempt to retry will kick in, so it would be good to have the fix backported and around at that point [18:05:31] or I will need to shoot them all again [18:05:45] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/852220/ [18:05:46] jynus: we should be good, logrotation will gzip the .1 and gain even more space [18:05:53] cool [18:05:56] ^ this is the possible fix? [18:06:03] that cool is for elukey [18:06:06] no that can't be it [18:06:13] I got lost int he related tickets [18:06:16] do we worry about the CPU saturation alerts on LVS? [18:06:26] bblack: I think that is for T322156 [18:06:27] T322156: New errors during this month's full dump run: LoadBalancer.php: No server with index '4' - https://phabricator.wikimedia.org/T322156 [18:06:36] yeah [18:06:39] this is another not yet tracked issue [18:06:52] I will open a ticket to track it [18:07:04] ok [18:08:21] there are more things to clean up, in addition to dumps + patch [18:08:27] as db maintenance was stopped [18:08:47] I am going afk folks, have a good rest of the day :) [18:09:57] As IC I declare this incident stopped but not fully resolved. [18:10:17] +1 [18:10:25] Thanks to everyone for their help. [18:10:37] denisse|m: sorry, I was bold to change it in the doc witout telling you [18:10:48] my fault [18:11:41] thanks denisse|m :) [18:12:28] I'll be pingable for awhile yet, but I wo't be actively following this channel, as it seems there's nothing I can do directly at the moment [18:12:42] please do ping if that changes or anyone needs more info about the dumps side of things [18:13:03] I am creating the ticket, apergos [18:13:15] go ahead and add me if you don't mind [18:13:20] conf hosts are technically owned by service ops or traffic? [18:13:28] just to add relevant tags [18:13:36] I will add core team too [18:13:56] service ops I think [18:14:02] apergos: what is your team tag? [18:14:03] I barely know how they work anyways :) [18:14:11] yeah, it is more of a technicality [18:15:14] uh... this doesn't go on their team inbox, jus tput dumps-generation on it [18:15:16] jynus: nowadays you can grep it in the puppet repo for many roles. code says serviceops, yea [18:15:31] apergos: doing, sorry [18:15:34] that way Hannah will see it. and I'll subscribe in any case [18:17:05] https://phabricator.wikimedia.org/T322360 [18:17:14] It doesn't seem to me this incident had any secret stuff and it was all handled in the public channel so.. you might as well decide to make it the "quick incident report" instead of the full one and do it on wikitech right away [18:17:30] might save some work vs first doing a nice doc and then wikitech as well [18:17:55] the state of our icinga alerts is pretty bad lately in general, there's always a ton of outstanding alerts [18:17:56] not 1005, 1013 I think [18:18:00] oh [18:18:01] (re: task description) [18:18:02] makes it hard to see where the real problems are :P [18:18:11] fixing and adding more info [18:18:27] I am mostly worried to track all pending stuff (a patch, stopped maintenance, etc) [18:18:38] yea, imho that's because icinga does not email people, and a couple people who kept looking at web UI have stopped doing so [18:20:27] constantly having unhandled CRITs has always been an issue but it tends to, including those that have 'disabled notifications' but are still in 'unhandled' and not downtimed [18:22:13] * volans back, catching up with backlog [18:22:44] so we have one "CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff" that is pending [18:23:06] but I am sure that is known at DBA [18:23:21] maintenance for dbs was stopped, probably related [18:23:34] the other 28 active alerts are not related to this incident as far as I can see [18:24:03] but it is important because not sure how maintenance will behave if it restarts with that uncommited [18:24:34] jynus: yea, it fits exactly when that was done. 35 to 45 m ago [18:25:16] let me see, I will be a bit safer if it is discarded [18:25:48] if someone can do dbctl config diff [18:25:59] I am on it, marostegui don't worry"! [18:26:02] and paste it here I can check it (I'm on my phone now) [18:26:24] jynus: :) [18:26:32] it was the increase of db1144 's load [18:26:41] "db1144:3315": 75, "db1144:3315": 100, [18:26:44] I am guessing it finished in the middle of a repool [18:26:49] "db1144:3315": 150, "db1144:3315": 200, [18:26:51] and then it is safe to commit yeah [18:26:56] doing [18:27:04] thanks! [18:27:51] apergos: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/852990 [18:28:17]