[09:52:09] https://www.tomsguide.com/news/zoom-security-privacy-woes I am leaving this here; there are still some trainings etc that use zoom, and zoom app security is still an issue, and as someone with escalated privileges on the cluster I still won't install it. [11:19:57] Emperor: plan to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/849595 unless you have objections [11:39:19] jbond: please do (I just +1d your revised version) [11:39:26] ack doing so now cheers [11:39:57] done [11:53:54] apergos: interesting [11:53:59] My work ban zoom [11:56:09] Well heavily discouraged for anything but allowed for public conferences [12:51:02] all the puppetmasters on cloud vps started failing to run puppet, saying something about fetch_swift_rings ca_server having an unexpected value, anyone is working on it? [12:53:34] I think it might be https://gerrit.wikimedia.org/r/c/operations/puppet/+/852767, adding the type Stdlib::Host for that param that made the runs fail, cc. jbond vgutierrez [12:54:12] cc taavi [12:55:43] maybe this should be updated: hieradata/cloud.yaml:puppet_ca_server: ~ [12:55:58] Hmmm cloud.yaml had a value in place for puppet_ca_server [12:56:03] (not a Stdilb::Host) [12:56:56] suspe4ct you just need yto change the type definition to Optional[Stdlib::HOst [12:57:01] suspe4ct you just need yto change the type definition to Optional[Stdlib::Host[ [12:57:08] yep [13:00:12] or we could alias puppet_ca_server to puppetmaster in cloud.yaml, that seems like a sensible default for most projects [13:01:28] also works [13:04:58] https://www.irccloud.com/pastebin/wWXOZrYe/ [13:05:41] it added the following entry to puppet.conf (that I think it's expected): [13:05:42] +ca_server = paws-puppetmaster-2.paws.eqiad1.wikimedia.cloud [13:06:30] yeah it is [13:06:50] ca_server defaults to server [13:06:59] So it's effectively a NOOP [13:20:57] are you submitting a CR dcaro? [13:21:04] oh, I was not xd [13:21:39] I can do it after lunch [13:24:38] https://gerrit.wikimedia.org/r/c/operations/puppet/+/852839 there you go :) [13:30:02] Nice... could you get rid of deployment-puppermaster04.yaml? [13:30:19] As part of that CR I mean [13:33:28] is that related? [13:35:05] same thing basically [13:35:17] done [13:36:10] running PCC against a couple of puppetmasters in cloud [13:36:17] 👍 [13:36:24] sigh.. no facts for paws-puppetmaster-2.paws.eqiad1.wikimedia.cloud [13:36:52] and for deployment-puppetmaster04 is a NOOP: https://puppet-compiler.wmflabs.org/pcc-worker1001/37940/ [13:38:45] please also ensure the cloud-puppetmasters in cloudinfra look good [13:39:09] are those exporting facts to the pcc-compilers? [13:39:18] yes [13:39:54] do you have a FQDN handy? [13:40:01] I don't have visibility of that project [13:41:12] cloud-puppetmaster-03.cloudinfra.eqiad.wmflabs,cloud-puppetmaster-05.cloudinfra.eqiad1.wikimedia.cloud [13:41:23] they have slightly different configs so please do both [13:41:35] thx :) [13:42:12] (running PCC as we speak) [13:43:04] taavi: looking good IMHO: https://puppet-compiler.wmflabs.org/pcc-worker1003/37941/ [13:48:27] lgtm [13:52:00] apergos: new dumps failure of the day: https://phabricator.wikimedia.org/T322328 [13:52:32] no idea at all. uses that fancy kerberos stuff [13:53:54] got the error? because I didn't get any error email [13:53:54] dcaro: thx <3 [13:54:11] 👍 [13:58:02] andrewbogott: ? [13:58:24] apergos: the error is showing up on my alert manager dash [13:58:51] and I stopped trying to debug once I saw '/usr/local/bin/systemd-timer-mail-wrapper' [13:58:59] thoughts about who I might ping about that? [13:59:13] do you have the actual error from the systemd unit? at least it could go in the task [14:00:29] Luca Toscano make the kerberos related changes and may know something about that [14:00:39] I added systemd status but it is not helpful [14:00:49] vgutierrez: dcaro: cloud-puppetmaster-03 needs to be able to sign certificates for tenant VMs, would changing ca_server there break it? [14:01:39] so if ca_server isn't currently set [14:01:40] hmmm, as long as it's the same as it was using it should be ok no? [14:01:49] (as in the default value it was using) [14:01:50] setting it to server isn't a change [14:01:53] it's just stating the default value [14:02:44] what's the default value? [14:03:15] ca_server = server [14:03:30] ah [14:03:37] so that should be fine then I think [14:04:58] under the hood this appears to run /usr/local/bin/hdfs-rsync (I am looking at modules/dumps/manifests/web/fetches/analytics/job.pp) [14:05:47] let me double check that actually [14:06:44] yes, so you might try running the whole command manually, with the /usr/local/bin/kerberos-run-command wrapper [14:07:16] usr/local/bin/kerberos-run-command dumpsgen /usr/local/bin/rsync-analytics-pageview_complete_dumps or perhaps the same thing with [14:07:55] /usr/local/bin/hdfs-rsync -r -t --delete --exclude "readme.html" --perms --chmod D755,F644 hdfs:///wmf/data/archive/pageview/complete/ file:///srv/dumps/xmldatadumps/public/other/pageview_complete/ substituted in for the rsync-analytics-pageview_complete_dumps command and see if you get something better, maybe a full stack trace [14:07:57] D755: Add visual error handling to dashboard and detail - https://phabricator.wikimedia.org/D755 [14:08:06] I've never looked at any of this stuff at all, mind [14:09:05] if it failed 8 hours ago, what was working before that and broke then? [14:09:15] andrewbogott: ^^ [14:09:36] I'm in meetings now, sorry -- please tag anyone you an think of on that ticket [14:10:24] ok [15:06:06] who owns cloudmetrics100{1,2}? They're both emailing ever 5 minutes to say ModuleNotFoundError: No module named 'graphite' in /usr/local/sbin/graphite-index [15:12:30] Emperor: usually I'd say just check /etc/wikimedia/contacts.yaml [15:12:48] but in this case that backfires... as they have the spare::system role [15:13:35] 😿 [15:13:39] the naming suggests WMCS, although I thought those hosts were already decomissioned.. cc andrewbogott [15:14:13] yeah I think they were put into spare::system from their original role [15:14:44] https://phabricator.wikimedia.org/T297712 [15:14:49] shall I just go and nobble the cron-jobs, then, if they're spare::system puppet won't put them back? [15:14:58] T297444 [15:14:59] T297444: decommission cloudmetrics100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T297444 [15:15:01] yep, kill whatever you need to kill :) [15:15:18] andrewbogott: anything preventing to run the decom cookbook? [15:15:21] andrewbogott: is there anything remaining on those? [15:15:37] that would be the more complete approach, but maybe there's still stuff needed on the nodes [15:15:38] the decom task is 1y old :( [15:16:03] I am trying to attend the monthly meeting so not fully tuned in here. Probably they can be decom'd. [15:16:18] Have they been mailing every 5 minutes for a year? [15:16:31] Not to me... [15:18:09] [crontabs edited] [15:19:13] thx [15:19:43] [also trying to pay attention to the staff meeting] [17:10:06] folks I am checking conf1008, root partition almost filled [17:10:15] 33G etcd_access.log.1 [17:11:22] 1007 and 1009 seem leading to the same result, from host-overview it seems something gradual over time [17:12:49] as immediate step, I'd say to just truncate the log files to gather say 10G of space [17:13:06] anything against it? [17:13:10] that's weird, all calls to /v2/keys/conftool/v1/mediawiki-config/?recursive=true [17:13:16] yeah [17:13:30] elukey: I just freed another 1% disk with "apt-get clean" [17:13:34] at a speed that i would say are not caching it as it was supposed to [17:13:40] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=conf1009&var-datasource=thanos&var-cluster=etcd&from=now-7d&to=now [17:13:43] volans: --^ [17:13:48] there was a sudden change [17:14:03] yeah I was looking for the same [17:14:08] I bet there was a change in behavuour [17:14:12] elukey: let's just gzip the .log.1 [17:14:29] mutante: that's for later, we need to fix the root cause [17:14:40] MW should not depend on etcd "directly" [17:14:51] Ok, I thought preventing it from running full was before root cause. [17:15:07] PROBLEM - Disk space on conf1008 is CRITICAL alerted a bit ago [17:15:07] there are still 3G left :D [17:15:15] oh I see, reading up now :) [17:15:17] ok then, stepping back [17:15:37] mutante: there is no space for gzipping teh log [17:15:40] volans: let's truncate the logs now to get some breathing room, then we fix the root cause [17:15:51] we can only move it to /srv then gzip it and hte move it back if we want to keep it [17:15:56] not sure it's worth though [17:16:05] yeah if it was that important it'd be in logstash I assume [17:16:06] to keep it I mean [17:16:40] truncate -s 20G? [17:17:10] depends how the log writer works, that could cause problems if it's the live file being written to [17:17:19] we can truncate the .1 only [17:17:25] yeah it is already rotated [17:18:05] I don't see a SAL entry at that time [17:19:51] mmm I see a lot of snapshot10XX nodes in the nginx access log [17:21:03] and that's why I suggested just doing the .1 file like you did as well [17:22:10] tcp6 0 0 snapshot1013.eqia:52598 conf1009.eqiad.wmn:4001 TIME_WAIT - [17:22:13] tcp6 0 0 snapshot1013.eqia:41702 conf1009.eqiad.wmn:4001 TIME_WAIT - [17:22:16] tcp6 0 0 snapshot1013.eqia:39872 conf1007.eqiad.wmn:4001 TIME_WAIT - [17:22:19] a ton of them on 1013 [17:22:21] elukey@snapshot1013:~$ sudo netstat -tuap | grep snapshot | wc -l [17:22:21] 42007 [17:24:30] elukey: I'm analyzing a bit etcd_access.log.2.gz in /srv, I'll let you know what I find [17:24:41] (more clear start time and all client IPs) [17:27:01] do we know why snapshot nodes would fetch /v2/keys/conftool/v1/mediawiki-config/?recursive=true [17:27:04] ? [17:27:30] that's the mediawiki config for connecting to any DB [17:27:47] okok [17:28:04] apergos: o/ around by any chance? [17:28:48] grepping for 01/Nov/2022:08:09 (just 1 minute) returns 4670 requests!!! [17:30:01] do you think it is a script fetching the mw-config and hammering conf100x ? [17:30:26] I think it's the "wikidump" that started on November 1st. [17:30:28] not sure [17:30:33] python3 ./worker.py --configfile /etc/dumps/confs/wikidump.conf.dumps ... [17:30:38] interesting [17:30:38] yeah I see at 8:29 though [17:30:50] not exactly matching, but it is the closest one [17:31:00] if you look inside that /etc/dumps/confs/wikidump.conf.dumps there are all the dblists in the config [17:31:02] do we have it on all snapshot nodes? [17:31:24] and this thing starts at the beginning of the month and is still running [17:31:37] user "dumpsgen" [17:32:27] elukey, mutante: I have to step out for a bit, can I leave this to you for now? [17:32:50] I was about to do the same :D I can stay a bit more yes [17:33:30] is there anybody in the US timezone that could takeover? [17:36:43] I guess the answer is no :D [17:37:32] I think we could leave it to the people oncall once they're donw with the incident [17:37:36] I'll be back in ~40m [17:37:41] sorry gotta go right now [17:37:57] ok so I am truncating the nginx logs, some nodes reached 100% [17:39:04] FYI I ran `sudo truncate -s 20G /var/log/nginx/etcd_access.log.1` on all 100X nodes [17:39:37] so the disk space is now under control [17:39:48] trying to track down how all this links up [17:39:59] we've had some minor impacts on lvs servers from conf1007 being unhealthy [17:40:24] but the disk got full there because of excessive going on with snapshot1013? [17:40:39] vgutierrez: ^ :) [17:40:47] oh :) [17:40:50] it's like IRC channel whackamole [17:40:53] pybal is still struggling [17:41:11] bblack: this is my understanding yes, I am trying to figure out if it is a single snapshot or multiple ones. On snapshot1013 I see a ton of conns in TIME_WAIT [17:41:30] struggling as in getting 502s from conf1007 [17:41:52] yeah cause and effect could be backwards, possibly [17:42:11] vgutierrez: :( https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=conf1009&var-datasource=thanos&var-cluster=etcd&from=now-7d&to=now [17:42:18] all nodes are hammered [17:43:15] I am hearing there was one change put in place regarding the db reload that might affect dumps on snapshot [17:43:18] nothing has changed on our side on the snapshot hosts; when do we see the increase start? [17:43:26] actually pybal is part of that hammer /o\ [17:43:36] because e.g. the db config "reload" (it's not really that) was in place at the start of the run [17:43:38] it's quite aggressive on the reconnection attempts [17:43:45] apergos: hi! See the above graph, I'd say Nov 01 around 8:15 UTC [17:43:55] that's with the start of the run then [17:44:06] only snapshot1013? [17:44:12] because it doesn't do anything special [17:44:16] so far I checked only that one [17:44:26] lemme see the others [17:44:45] marostegui: ping [17:45:04] ^ not sure how it all connects up, but I'm worried the pace of the database maintenance work may be a factor here [17:45:33] can we get a pause there, is that possible? [17:45:34] hey what's up [17:45:50] we're having all kinds of problems centering on confd falling apart [17:45:52] conf1008: 24GB, conf1007,conf1009: 25GB [17:46:05] all conf nodes have large logs. eqiad-only [17:46:06] apergos: I see a ton of conns to conf100x nodes on snapshot10[10,13,12,11] [17:46:07] there's some connection to snapshots hosts, to db reload something something [17:46:11] hmmm I'm no etcd expert but etcd.service crashed on conf1007 due to lack of HD [17:46:17] and I see tons of db maint !log flying by [17:46:18] is it safe to restart it? [17:46:27] vgutierrez: yes let's do it [17:46:29] bblack: it does sound like the dbreload change more and more [17:46:33] marostegui: so I'm starting to suspect the db maint pace may be a driver [17:46:35] apergos: the one you mentioned by Amir [17:46:40] it's not "reload" exactly [17:46:43] bblack: snapshot hosts? those are for XML dumps? [17:46:45] it's affecting lots of other stuff indirectly [17:46:54] bblack: let me stop it then just in case [17:46:54] it's dropping a db when it's out of the pool [17:47:02] yes they are for xml dumps indeed [17:47:07] the process that started on Nov 1 at the time is: python3 ./worker.py --configfile /etc/dumps/confs/wikidump.conf.dumps ... [17:47:18] and it reads all the dblists from that config file [17:47:28] Amir1: let's pause yours too [17:47:30] and I dunno if that change by Amir is really at issue here, it's just the only thing I know that was put int play for the new run [17:47:44] there is something else starting at 08:00 ish "moritzm: installing glibc security updates on buster" [17:48:08] saying it because db maintenance updates I think happen 24/7 [17:48:25] could a securty update caused some daemon reaload or something? [17:48:29] folks can somebody (possibly in US timezone) become IC? [17:48:47] vgutierrez: I see the cluster healthy now, and etcd working :) [17:48:50] denisse|m: around? as bblack is debugging [17:48:51] yep [17:48:56] lvs1020 is happy now again [17:48:58] we're not causing an outage with this, AFAIK [17:48:59] bblack: my maintenance is now stopped, let me check amir's [17:49:06] [yet] [17:49:14] it is not bad to have people around just in case [17:49:17] just internal toil [17:49:20] elukey: i'll take it, you can leave [17:49:27] Yes. [17:49:31] Ack [17:49:33] let others debug, try to keep track [17:49:53] bblack: well we need some coordination, etcd went down on a node and pybal suffered because of it, we are close to cause one in my opinion :) [17:50:37] bblack: stopped also all Amir's maintenance but one thread (which is currently executing a long alter table) but all the pooling/repooling is now stopped [17:50:42] apergos: is it doable to stop the run? [17:50:55] are the pybal connections to etcd alerts expected to clear on their own? e.g. PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 12 connections established with conf1007.eqiad.wmnet:4001 (min=73) [17:51:01] I know it is a pain but at the moment it seems a quick way to decrease the pressure [17:51:28] elukey: I guess but it will pick back up in a few hours [17:51:37] we'll lsoe whatever it's in the middle of [17:51:40] herron: I'm not 100% sure. some will, maybe. we might need restarts, the pybal+etcd integration isn't known for being amazing [17:51:54] mutante: <3 [17:51:56] looking at conf1009 to prevent disk space running out [17:51:58] kk that's what I was wondering essentially if a bounce would be necessary for those to recover [17:52:14] so... tjehculprit IMHO is "wikimedia/multi-http-client v1.0" [17:52:26] yep [17:52:29] -rw-r----- 1 www-data www-data 25G Nov 3 17:52 etcd_access.log [17:52:29] -rw-r----- 1 www-data www-data 20G Nov 3 17:38 etcd_access.log.1 [17:52:41] On phone e [17:52:46] I truncated etcd_access.log.1 earlier on vgutierrez [17:52:57] on all three nodes [17:53:06] (it was around 35G) [17:53:08] Amir1: No worries, I went to your screens and stopped everything that was doing pools/depools, just leave one of them which is still running the big alter [17:53:16] right, the first alert of all this was the confd1007 disk space alert, that's kinda where it all started and the first thing that was mostly-mitigated in the realtime sense [17:53:21] apergos: yes let's stop it if possible, at least we'll know if it is the run or not causing this [17:53:22] so we got 55G of access log in conf1009 [17:53:30] and snapshot1010 is going at it like a maniac [17:54:01] vgutierrez: not only 1010, there are 4 snapshot nodes with tons of TIME_WAIT sockets (like thousands) [17:54:03] but who is accessing etcd? [17:54:04] I'll check it once I'm back [17:54:07] or what? [17:54:19] snapshot10[10,13,12,11] [17:54:20] jynus: something from snapshot instances using UA wikimedia/multi-http-client 1.0 [17:54:31] oh sh*t, I gotta go now, sorry folks [17:54:42] oh, I thought that was accessing dumps [17:54:49] I'm leaving 1008 alone, as it does different sorts of runs [17:54:50] understood [17:54:54] Hmm. I think the reload is put in the wrong for loop [17:54:58] apergos: ack thanks [17:55:06] it's fetching GET /v2/keys/conftool/v1/mediawiki-config/?recursive=true [17:55:33] so potential bugs on mw /dump process? [17:55:50] Amir1: got a link? [17:56:02] I need to look at the logs to be sure [17:56:12] bblack: one sec [17:56:42] they should all be gone now [17:57:14] it makes sense at least, it links dumps and etcd (mw config) [17:57:15] apergos: I see some recovery [17:57:20] Here's the document: https://docs.google.com/document/d/1gq8yOn_d8PhEyjNzMytBvPR39596x8cl4g_4AfwNiQ8/edit?usp=sharing [17:57:27] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=conf1009&var-datasource=thanos&var-cluster=etcd&from=now-30m&to=now [17:57:30] confirmed on snapshot1013 the process is gone. still seeing a ton of connections from there to all conf hosts, all in TIME_WAIT and all tcp6 [17:57:30] apergos: --^ [17:57:50] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/798678/13/includes/export/WikiExporter.php [17:57:54] ah, and now it stopped [17:57:54] I think the doc will be useful to organize all ideas [17:58:09] even if not a proper outage [17:58:10] the connection attempts to conf hosts stopped on snapshot1013 [17:58:20] yep so it was what apergos killed :) [17:58:22] yeah [17:58:29] I was going to say XD [17:58:44] yes [17:58:58] apergos: so it was " python3 ./worker.py --configfile /etc/dumps/confs/wikidump.conf.dumps", was it? [17:59:04] also matches the time frame [17:59:14] there are maintenance scripts that are run [17:59:17] so it was those I expect [17:59:18] apergos: how bad it is to interrupt that? [17:59:19] does it want to talk to conf to get dblists? [17:59:51] example: [17:59:55] /usr/bin/php7.4 /srv/mediawiki/multiversion/MWScript.php dumpBackup.php --wiki=hewiki --dbgroupdefault=dump --full --stub --report=1000 --output=file:/mnt/dumpsdata/xmldatadumps/temp/h/hewiki/hewiki-20221101-stub-meta-history5.xml.gz.inprog_tmp --output=file:/mnt/dumpsdata/xmldatadumps/temp/h/hewiki/hewiki-20221101-stub-meta-current5.xml.gz.inprog_tmp --filter=latest --output=file:/mnt/dumpsdata/xmldatadumps/temp/h/hewiki/hewiki-2 [17:59:55] 0221101-stub-articles5.xml.gz.inprog_tmp --filter=latest --filter=notalk --filter=namespace:!NS_USER --skip-header --start=1466624 --skip-footer --end 1476386 [17:59:58] (sorry) [18:00:24] nothing in the python scripts changed since the last run on the 20th [18:00:28] I assume we don't log etcd reads, it has to be execessive writes right? [18:00:38] (to fill the logs) [18:01:13] yeah, most likly (I am guessing) the logic on dump taking was too overzealous when reloading db configuration [18:01:28] context: https://phabricator.wikimedia.org/T298485 [18:01:42] The last read does reload in processing every row [18:01:51] ups [18:02:06] I will make a patch for that once back [18:02:33] https://phabricator.wikimedia.org/T322156 [18:02:51] Amir1: when you told me you wanted config to be reloaded more frequent, I didn't think you needed it so frequently! :-P [18:03:04] um. yeah :-) [18:03:08] bblack: it's GET /v2/keys/conftool/v1/mediawiki-config/?recursive=true [18:03:11] at least that is an easy fix then [18:03:21] apergos: to be fair, with that bug, backups would had taken too long, at least for medium to large wikis [18:03:27] jynus: I didn't write that part 🥲 [18:03:31] s/backups/dumps/ [18:03:45] Amir1: in any case, I was joking! [18:03:58] as a follow up, we should probably monitor something like https://grafana.wikimedia.org/d/tTE9nvdMk/etcd?orgId=1&from=now-7d&to=now&viewPanel=4 [18:04:04] right now they are semi-broken but that's a side issue, Amir is looking at a fix for that too [18:04:29] I am going to help fill in the incident doc, even if impact was mostly internal [18:05:10] do we need to delete more logs? [18:05:26] in 2 or maybe it is 3 hours, the evening attempt to retry will kick in, so it would be good to have the fix backported and around at that point [18:05:31] or I will need to shoot them all again [18:05:45] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/852220/ [18:05:46] jynus: we should be good, logrotation will gzip the .1 and gain even more space [18:05:53] cool [18:05:56] ^ this is the possible fix? [18:06:03] that cool is for elukey [18:06:06] no that can't be it [18:06:13] I got lost int he related tickets [18:06:16] do we worry about the CPU saturation alerts on LVS? [18:06:26] bblack: I think that is for T322156 [18:06:27] T322156: New errors during this month's full dump run: LoadBalancer.php: No server with index '4' - https://phabricator.wikimedia.org/T322156 [18:06:36] yeah [18:06:39] this is another not yet tracked issue [18:06:52] I will open a ticket to track it [18:07:04] ok [18:08:21] there are more things to clean up, in addition to dumps + patch [18:08:27] as db maintenance was stopped [18:08:47] I am going afk folks, have a good rest of the day :) [18:09:57] As IC I declare this incident stopped but not fully resolved. [18:10:17] +1 [18:10:25] Thanks to everyone for their help. [18:10:37] denisse|m: sorry, I was bold to change it in the doc witout telling you [18:10:48] my fault [18:11:41] thanks denisse|m :) [18:12:28] I'll be pingable for awhile yet, but I wo't be actively following this channel, as it seems there's nothing I can do directly at the moment [18:12:42] please do ping if that changes or anyone needs more info about the dumps side of things [18:13:03] I am creating the ticket, apergos [18:13:15] go ahead and add me if you don't mind [18:13:20] conf hosts are technically owned by service ops or traffic? [18:13:28] just to add relevant tags [18:13:36] I will add core team too [18:13:56] service ops I think [18:14:02] apergos: what is your team tag? [18:14:03] I barely know how they work anyways :) [18:14:11] yeah, it is more of a technicality [18:15:14] uh... this doesn't go on their team inbox, jus tput dumps-generation on it [18:15:16] jynus: nowadays you can grep it in the puppet repo for many roles. code says serviceops, yea [18:15:31] apergos: doing, sorry [18:15:34] that way Hannah will see it. and I'll subscribe in any case [18:17:05] https://phabricator.wikimedia.org/T322360 [18:17:14] It doesn't seem to me this incident had any secret stuff and it was all handled in the public channel so.. you might as well decide to make it the "quick incident report" instead of the full one and do it on wikitech right away [18:17:30] might save some work vs first doing a nice doc and then wikitech as well [18:17:55] the state of our icinga alerts is pretty bad lately in general, there's always a ton of outstanding alerts [18:17:56] not 1005, 1013 I think [18:18:00] oh [18:18:01] (re: task description) [18:18:02] makes it hard to see where the real problems are :P [18:18:11] fixing and adding more info [18:18:27] I am mostly worried to track all pending stuff (a patch, stopped maintenance, etc) [18:18:38] yea, imho that's because icinga does not email people, and a couple people who kept looking at web UI have stopped doing so [18:20:27] constantly having unhandled CRITs has always been an issue but it tends to, including those that have 'disabled notifications' but are still in 'unhandled' and not downtimed [18:22:13] * volans back, catching up with backlog [18:22:44] so we have one "CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff" that is pending [18:23:06] but I am sure that is known at DBA [18:23:21] maintenance for dbs was stopped, probably related [18:23:34] the other 28 active alerts are not related to this incident as far as I can see [18:24:03] but it is important because not sure how maintenance will behave if it restarts with that uncommited [18:24:34] jynus: yea, it fits exactly when that was done. 35 to 45 m ago [18:25:16] let me see, I will be a bit safer if it is discarded [18:25:48] if someone can do dbctl config diff [18:25:59] I am on it, marostegui don't worry"! [18:26:02] and paste it here I can check it (I'm on my phone now) [18:26:24] jynus: :) [18:26:32] it was the increase of db1144 's load [18:26:41] "db1144:3315": 75, "db1144:3315": 100, [18:26:44] I am guessing it finished in the middle of a repool [18:26:49] "db1144:3315": 150, "db1144:3315": 200, [18:26:51] and then it is safe to commit yeah [18:26:56] doing [18:27:04] thanks! [18:27:51] apergos: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/852990 [18:28:17] !log jynus@cumin1001 dbctl commit (dc=all): 'increase db1144:3315 load', diff saved to https://phabricator.wikimedia.org/P38086 and previous config saved to /var/cache/conftool/dbconfig/20221103-182750-jynus.json [18:28:27] ^ marostegui [18:28:32] one less thing to clean up [18:28:38] I just got home, made the patch, I go afk for dinner but keep my phone with me [18:28:52] when do we reload? [18:29:06] thanks. rescheduled icinga check. <+icinga-wm> RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK [18:29:45] apergos: after db queries [18:29:56] seems ok to me [18:30:08] should I +1? who will merge/backport it? [18:30:12] thanks jynus [18:30:28] I will backport it, you can hit +2, in mw you should do that instead of +1 ;) [18:30:46] I have not tested nor have the confidence to +2 it though [18:30:56] is there anything I can help with? [18:31:13] funnily enough after processing the rows, it does reload the config (check for calls of outputPageStreamBatch) [18:31:22] checked disk space on conf* one more time and it's fine now. 83% max [18:31:23] jynus: I linked the MW patch to your ticket to keep the breadcrumbs going [18:31:42] so log has not been growing again since it was truncated and then the dumps were stopped [18:31:43] +1, that was the point of the ticket! [18:31:49] need to go, will return later [18:32:08] volans: I think it's just about documenting what happened. Denisse made a doc for it [18:32:14] volans: yeah, +2 the thing https://gerrit.wikimedia.org/r/c/mediawiki/core/+/798678/13/includes/export/WikiExporter.php if you feel confident about it :-P [18:32:16] it started today at 08:09, see graphs BTW [18:32:28] apergos: I have zero context on that one :D [18:32:37] apergos: then can you find someone who can? I can't self merge [18:32:39] jynus: No need to say sorry, thanks for helping me with the doc! :D [18:32:53] Daniel K. wrote it, please ping him [18:32:56] It's my first time on call so any help is greatly appreciated!! <3 [18:33:32] Amir1: which patch do you need reviewed? [18:33:42] volans: there is https://phabricator.wikimedia.org/T322360 meanwhile [18:33:43] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/798678/13/includes/export/WikiExporter.php [18:33:58] taavi: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/852990 [18:33:59] I'd +1 but not +2, that's the extent of my comfort [18:34:22] ah sorry I am linking the wrong thing in the meantime [18:35:17] duesen: ping? [18:35:44] bblack: in a meeting... what's up? [18:35:54] duesen: yo, shit's on fire [18:35:57] Amir1: +2'd [18:36:07] duesen: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/852990 [18:36:10] taavi: Thanks [18:38:12] I see where the confusion came from on adding the reload, the variable name is $lastRow but that is actually "previousRow" [18:40:17] hm while we're in here I wonder if we can get a review of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/852220/ from anyone (this is needed because stubs dumps for a number of large wikis are broken without it, but it is tangential to the immediate fire) [18:40:49] apergos: actually, it might also fix that issue, can't say for sure but will see [18:40:55] (the patch) [18:42:12] we'll see, as you say [19:02:07] thanks for the backport and deploy [19:06:27] we are out of the woods, I'll go eat dinner [19:06:41] please do and thanks again [19:07:02] You're welcome 😊 [19:20:07] * duesen is wondering if he broke anything [21:00:50] Are the wcqs pybal alerts expected? [21:01:13] inflatador: ^ [21:02:16] cc: ryankemper [21:02:47] RhinosF1: thanks, looking now [21:03:03] RhinosF1: Could it be related to this issue? https://phabricator.wikimedia.org/T322360 [21:03:05] ryankemper: np [21:03:06] this usually happens when an LVS service is added or removed [21:03:22] denisse|m: that's why I asked, it's just gone off though [21:03:24] then it might need coordination with -traffic and pybal restarts [21:03:31] And that's to my understanding mitigated [21:03:47] 20:57:36 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal [21:04:03] .67 should be wcqs eqiad [21:04:14] it sounds more like it's due to some hosts decoms or new installs [21:04:34] https://github.com/wikimedia/puppet/blob/production/hieradata/common/service.yaml#L2567 [21:05:10] yeah blazegraph looks happy on `wcqs1001` for example, but all of `wcqs100[1-3]` are depooled in pybal [21:05:25] We're not doing any changes to our lvs config, seeing if anything recent in #operations might explain it [21:05:25] there are docs for this specific error: https://wikitech.wikimedia.org/wiki/PyBal#Services_known_to_PyBal_but_not_to_IPVS [21:05:40] nice, thanks! [21:05:50] starting with "This alert is usually temporary " [21:06:07] I'm also on wcqs1001 still, will try to re-pool [21:06:27] either it is indeed temporary or if it's not you might need pybal restart [21:06:52] Ah, I think I see what happened here. This likely doesn't need a pybal restart [21:06:56] in that case I would maybe ask -traffic [21:07:02] ok, cool [21:07:05] inflatador: we did the data xfer from wcqs1003->wcqs1001, correct? [21:07:07] ^^ yeah, this is related to our data xfer cookbook [21:07:18] indeed, and it looks like 1002 never got repooled either [21:07:24] just pooled 1001 and 1002 [21:07:36] we should check codfw, they should all be pooled AFAIK [21:07:48] denisse|m: yea, unrelated for sure [21:07:48] (update: they are) [21:08:53] those alerts can take some time to realize a change [21:09:23] it was our fault, all the hosts in the DC got depooled, there should have been at least one left [21:09:28] mutante: it recovered [21:09:38] We're back to normal now [21:09:40] RhinosF1: ah, cool, other channel of course :) [21:09:41] Thanks to RhinosF1 for checking IP 10.2.2.67 and getting back to us [21:09:52] inflatador: no problem [21:10:00] Glad we caught it [21:10:06] Would it have impacted anyone? [21:10:28] May have slowed ppl down as only CODFW was pooled [21:10:41] but we have no SLA for WCQS yet, it's still in beta [21:10:53] Any WCQS requests routed to eqiad would have failed during the window (I assume like ~10ish minutes) [21:11:29] inflatador: btw normally pybal refuses to depool a host if would bring it below the configured desired threshold. However 1001/1003 were depooled from us running the `sre.wdqs.data-transfer` cookbook, so perhaps the way it depools gets around that limitation? [21:11:33] Is it mini report worthy then to see if anything can be done to avoid future accidental depooling of all backends in a DC? [21:11:54] would be nice if those errors displayed the VIP DNS name [21:12:00] instead of IP [21:12:40] RhinosF1 I think it falls under this existing ticket https://phabricator.wikimedia.org/T321605 , but feel free to open another if you prefer [21:13:00] ryankemper: I think you are onto something there. there is definitely some lower limit and it would alert in different ways when it gets under that [21:13:23] but also if you only have 1 or 2 servers per DC it's kind of hard to not hit that [21:13:28] the behavior is specific to the query service playbooks, which we're already working on cleaning up [21:14:02] yea, good idea, inflatador, the alert should do the DNS lookup [21:14:17] inflatador: FWIW RhinosF1 was asking about a lightweight incident report, not a phab ticket [21:14:50] ryankemper: yes [21:15:05] Ah yeah, I would say no b/c there's no SLA/SLO but I defer to RhinosF1 or anyone else w/more experience on how to handle incidents [21:15:51] inflatador: id argue there's a follow up though to prevent future repeats [21:16:10] And it wasn't actually responded to because of SRE catching the alerts there [21:16:15] It was a nosey annoying user [21:16:22] I guess it becomes an incident if users actually noticed this but not otherwise. [21:16:27] RhinosF1: :P [21:17:10] I think I'd be in favor of just creating a phab ticket for the DNS lookup (if it's not feasible the ticket could be closed out), and adding some investigation into the depool logic of our `sre.wdqs.data-transfer` cookbook as an acceptance criteria of https://phabricator.wikimedia.org/T321605 [21:17:26] I could go either way though [21:17:27] In that case, I'd say that using the DNS name for the VIP instead of IP should be part of any follow-up [21:17:49] I'll make the DNS name request, guessing Observability would own that? [21:18:03] ryankemper: tickets suit me [21:18:13] As long as it's been properly learned from [21:18:23] inflatador: probably add #traffic too [21:18:46] Probably should be one about warning if too much is depooled too when the action being carried out [21:18:56] RhinosF1: cool, we'll go that route in the interest of minimizing overhead [21:19:17] ryankemper: can you add me onto any created or link them here [21:19:37] RhinosF1: Will do [21:19:52] inflatador: if you want to open the dns lookup ticket I'll add the blurb to the data transfer cookbook ticket [21:23:57] ACK [21:28:14] RhinosF1: Added you as a subscriber to https://phabricator.wikimedia.org/T321605. The bit about investigating the depooling is the final bullet of the acceptance criteria in the ticket desc [21:30:35] Thanks [21:32:04] And here's the ticket for DNS name instead of IP: https://phabricator.wikimedia.org/T322377