[09:03:29] ContentUpdatesDisabled sounds good but the doc also mentions wgDisableSearchUpdate, so it's not entirely obvious what's the diff between the two just reading the names [11:07:32] lunch [14:11:13] o/ [16:00:36] \o [16:03:29] o/ [16:04:29] cloudelastic consumer looking happy enough, no obvious complaints or backlogging [16:04:38] ( ^_^)o自自o(^_^ ) CHEERS! [16:05:01] :) [16:05:51] nice! [16:06:53] did another run through of the saneitizer logs, nothing terrible. It did remind me that we need to get the saneitizer going though, the errors that come out are still the pre-feb 12 deletion failures and they dont get auto-fixed [16:07:09] i suppose going to review the open tickets, but if nothing jumping out will start evaluating if a flink-based saneitizer can work via api? [16:07:36] i suppose the open question is how hard the bit where we load 1000 pages from the sql db is from mw api instead [16:08:48] i suppose alternatively, we could have a remediator that generates events that the SUP understands i guess [16:10:32] * ebernhardson wonders if keeping the saneitizer but have it emit events is any better...it's certainly simpler [16:11:26] yes this is an open question, def simpler to keep the same CheckerJobs thing and have them emit events for the new SUP [16:11:36] soemthing about that just feels ugly though :P [16:12:15] indeed, problem is how to feed the list of ids to check [16:12:31] could it be done offline from a spark job? [16:12:36] i'm not too worried about that part, i'm pretty sure i can magic up something with a fake flink sink and rocksdb [16:13:03] s/sink/source/ [16:13:36] hmm, i suppose i'm not a big fan of offline because it doesn't answer the rerendering problem [16:14:32] a source that would just ship a slow range(1, max_id) over and over? [16:14:51] yea, when it nears max_id it can re-poll the mw api to find where the end really is [16:15:58] so we'd need to just expose the checker via API [16:16:24] it sounds reasonably simple. Given a start date, an end date, and a max_id that should be emitted, figure out what the max_id that should have been emitted up until now, then emit the ids between the max that has been emitted and now [16:16:59] hmm, i suppose i was thinking to re-implement the checking in the SUP side, but maybe we do just expose the checker as an api and let flink handle the id generation [16:18:01] do we parse when checking? [16:18:19] no, i would have to double check but i think it's just a page table query [16:19:15] seems pretty independing from other jobs, do we stuff into existing flink jobs or create a new one? [16:19:23] independing/independent [16:20:13] could be done from the producer I guess? [16:20:19] i suppose i have been thinking it fits into the consumer. The result of checking is basically an update request [16:20:33] the problem with producer is the producer is all-clusters, saneitize is per-cluster [16:20:53] ah true, three clusters, three checks, indeed [16:21:04] would it be nice to have events in the middle, if only for debugging? [16:21:11] basically saneitize->events->consumer [16:21:12] so we'd not event need a new stream at all [16:21:19] s/event/even [16:22:10] hm... a side-output that would just go to hive might be sufficient? [16:22:17] yea, probably would [16:23:05] the source will need a state, hopefully should not be too hard [16:23:41] i guess i didn't check too closely, but yea i've been assuming flink source has easy access to state :) If not some stupid hack like a source that emits true once per second, then a rich flat map that has easy state access [16:24:28] a state at the source sounds cleaner, should not be hard I'm sure [16:24:39] this source must be disabled on backfills [16:25:04] perhaps enabled by a default-off property would work [16:25:30] event-time will be "now" but the consumer logic does nothing based on event-time IIRC so should just work [16:25:33] i guess default doesn't matter, backfill inherits config, it will have to be turned off regardless [16:26:01] seems reasonable [16:27:10] tuning the throuhput limitations might be the sole "difficult" part I guess [16:28:41] hm... thinking about it, it'll need to know the list of wikis? [16:29:35] hmm, yea i suppose it will. Can query site matrix? [16:30:16] yes I think so? [16:33:16] or noc.wikimedia.org, dunno [16:34:23] oh, yea that might work too [16:35:45] first stab at a ticket: https://phabricator.wikimedia.org/T358599 [16:37:47] sounds good to me [17:06:21] huh, it turns out ApiPageSet is already querying min and max page_id for the purpose of filtering out-of-range inputs. But i suspect better off skipping the query api and feeding id's directly [17:46:35] workout/lunch, back in ~1h [18:31:42] dinner [19:06:47] back [19:59:10] * ebernhardson realizes we haven't thought about the archive index [20:09:55] back from netsplit [22:13:54] inflatador: welcome back to the other side :) okay, it looks like we may be able to get access to a high speed nvme in our data center. https://phabricator.wikimedia.org/T352253 . i'll try to fill in some initial details, but would it be possible to discuss a bit more tomorrow morning? [22:14:03] ryankemper: ^ [22:14:50] dr0ptp4kt: yup! [22:15:16] dr0ptp4kt Y, send me an invite. Not sure if you have access to netbox but we should be able to get most of what DC OPs is asking from there [22:15:48] https://netbox.wikimedia.org/dcim/devices/1871/ [22:20:39] i do have access, thank you for checking inflatador . make that tomorrow afternoon and tomorrow morning for you two, respectively. i'm trying to squeeze in meetings with chris and miriam tomorrow during the usual weekly meeting, so will just schedule as one-off [22:34:52] ryankemper getting some weirdness on prom1006...puppet showed a change when I ran it, but it doesn't look like it actually changed anything based on modified date of the file [22:36:17] inflatador: what about contents of the file itself? does it look like https://puppet-compiler.wmflabs.org/output/1006992/3189/prometheus1006.eqiad.wmnet/index.html? [22:37:58] no. I think it has to do with me disabling puppet on the wdqs hosts first. I did enable it on a single host, but I'm going to try enabling on wdqs1011 (since it was explicitly listed in the puppet diff on prom1006) and running puppet on prom1006 and wdqs1011 again [22:50:00] nm, looks like it just took awhile to show up. Predictably, it's alerting [22:52:58] looks like a bad regex [23:11:25] ryankemper one more CR to fix the regex, hopefully the last ;) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1007006 [23:11:40] I tested it with https://regexr.com/ this time [23:12:49] inflatador: +1 [23:15:19] ACK thanks again [23:32:21] hmm, might've just worked https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fcustom&var-module=All&orgId=1&from=now-15m&to=now [23:37:40] or not...we had one success, then back to broken? [23:37:43] (╯°□°)╯︵ ┻━┻ [23:41:51] I enabled puppet on one more host. Probably won't do anything, but this is my last shot before rolling back [23:48:59] hmm, one of the two alerts resolved [23:53:11] the host I just added (wdqs2008) never went into alert, and 2007 cleared [23:54:50] wdqs1011 is still alerting, but based on my curl ( https://paste.opendev.org/show/btzdiAKw7zVu3J7FxwzR/ ) wdqs1011 responds just fine [23:55:06] Maybe because it's a test host, it doesn't have ferm rules for prometheus hosts? hmm [23:56:33] nope, that's not it