[06:13:18] _joe_: sirenbot is down [06:30:20] <_joe_> RhinosF1: thanks, someone will take care of that [06:30:29] <_joe_> I guess all the netsplits yesterday didn't help [07:29:58] !incidents [07:29:59] No incidents occurred in the past 24 hours for team SRE [07:30:27] that's the bot service restarted; I don't know if any further fettling is required [07:51:13] Krinkle: makes sense, I'll try to flip the flag and figure out the proper call sites to update [08:11:10] good morning! today is codfw row D switch upgrade (the last one), some teams table are not fully completed yet: Core-platform (cc hnowlan, urandom), IF (cc moritzm), ServiceOps (cc akosiaris) [08:28:20] XioNoX: I've updated the remaining IF hosts, the only two remaining ones are puppetmaster2002 (depool needs to happen close before the maintenance starts) and ping2003, for the latter could you already redirect ICMP on the routers to ping[13]003? [08:29:00] moritzm: thx, I'll take care of it closer to the maintenance [08:29:57] ack [08:30:46] We ll depool closer to the maint window [09:07:14] Krinkle: re all PO access going through pool counter - we can make it optional and turn if off per default. But why? Is it substantial overhead? Shouldn't all paths that lead to parsing have stampede protection? [09:25:53] related to ^ I've updated the patch to include the opt-in approach, please feel free to chose whatever option you consider best and possibly revert to PS2 that keeps the old behavior by default [09:26:19] XioNoX: ack, updated. Will depool closer to the time [09:26:39] thx! [09:53:11] <_joe_> duesen: definitely yes [10:52:43] XioNoX: codfw depooled. Ready for the maintenance [12:16:49] duesen: stampede protection is nice I guess but the problem is what happens under a supposed stampede. Apart from page views, PO access needs to give what you asked for or fail, not silently return yesterdays PO again to a job that's carrying final responsibility for updating a thing, same for CLI. These processes are already protected by being isolated in hardware and throttled in throughout [12:16:55] throughput [12:22:13] moritzm: ping offload disabled [12:23:47] <_joe_> Krinkle: uhm sorry what's PO in this context? [12:24:28] <_joe_> I think in general parsing attempts should all be under poolcounter - what to do if you can't get a lock might vary [12:25:01] <_joe_> so in your case, if poolcounter doesn't allow to get a lock, it should fail [12:27:46] upgrade staging done, relocating (hopefully before the bridge between here and home is raised) [12:34:12] _joe_: ParserOutput objects. HTML plus metadata. [12:37:24] Krinkle: as far as I can tell, both PoolWorkArticleViewCurrent and ParserOutputAccess don't serve stale cached content. Neither passes the $useOutdated flag to ParserCache::get [12:38:58] _joe_: well, just saying up until the refactor last year, we had never used PC for jobs or maint scripts, and I don't think it was moved there with intention. It seems almost accidental due to merging two generic-looking 'getParserOutput' methods (from Article/WikiPage and from ParserCache/ContentHandler) into one. [12:40:29] oh I see, the fallback() method calls getDirty(). We may want a flag to control that as well. [12:40:34] but yeah, it probably doesn't hurt, but also seems kind of pointless. The probability of locking there seems ~0 in a way that seems very hard to hit. There's no direct user control. It's a single-threaded CLI job iterating some batch, or a job responding to some one-time thing for a known revision/page that was written to after its own limitations. [12:41:55] for that to coincide with something else parsing and then benefiting in realtime from fetching it from the parser cache, I don't think it improves throughput to any measurable accuracy. I suppose we can try to emperically confirm that. To me it's an optimsation without proof that adds non-trivial runtime complexity. [12:42:09] <_joe_> Krinkle: oh so it's for jobs, sorry I completely missed context. Yes it's not what poolcounter should defend against in general [12:42:35] We could actually track that -- we are looping through a "render reason", and we could log the reason when hitting a poolcounter log [12:42:39] <_joe_> jobs run at a fixed concurrency anyways [12:43:17] a lot of code paths can hit getParserOutput, and several of them are shared between jobs and web based access. [12:43:24] I can see an argumenet for the inverse as well. In that by having only one way to access ParserCache it simplifies things for you as maintainer of that code. The tricky bit there is that then we end up with "one way" but its' really "one way with 5 boolean parameters" [12:43:41] We could have a flag explicitly disabling poolcounter, or disalowwing stale output. [12:43:58] <_joe_> yeah, I'd avoid creating a footgun the next developers could trigger [12:44:23] <_joe_> either you have a static method allowing to disable the use of poolcounter, or use it everywhere [12:44:25] The way it used to work is: 1) ParserCache get with set miss, simple, no special handling 2) ArticleViewPoolCoolcounter, uses ParserCache + poolcounter + speculative links update + saves to ParserCache [12:44:36] so these three optoins are always used together. [12:44:52] Krinkle: I prefer a single place for implementation, especvially to have proper handling of the interactions between these flags. But we can have several antry point methods with nice names. [12:44:57] *entry point [12:45:10] XioNoX: ack. I'll stop Puppet in codfw and the edges in ~ 5m (for puppetmaster2002) [12:45:37] cool [12:52:30] XioNoX: swift/thanos frontends done [12:53:13] depooling ores [12:53:42] hnowlan: are maps/restbase depooled? [12:53:55] I can do it if needed [12:54:12] XioNoX: puppet disabled, all IF hosts good to go [12:54:28] inflatador, ryankemper, good for the elastic hosts? [12:54:33] moritzm: thx [12:54:59] maps/resbase/elastic are the last ones afaik [12:55:18] XioNoX indeed [12:55:43] awesome [12:57:39] the maps/restbase hosts from the task are currently all still pooled (e.g. "sudo -i confctl select name=restbase2012.codfw.wmnet get" on pm1001), so I think you can depool them [12:57:46] yeah [12:57:49] doing now [12:58:34] `sudo cumin 'P{P:netbox::host%location ~ "D.*codfw" and restbase* }' depool` [12:58:36] same for maps [12:58:49] done [13:00:01] alright, going to start the switch reboot in 1 min unless anyone speak up [13:00:46] 👍 [13:01:12] let's go! [13:01:31] argh, haven't depooled ores machines yet [13:01:34] "System going down in 1 minute" [13:01:36] klausman: I did [13:01:39] thanks! [13:01:45] sorry for being late to the party [13:02:01] cool kids are always late to the parties [13:02:36] it's rebooting [13:03:07] something going on with cumin1001? [13:03:30] works fine for me [13:03:35] inflatador: wfm [13:03:40] are you maybe using a codfw bastion? [13:03:52] klausman Ah! that's gotta be it [13:04:24] ohhh, the new bast host is in row D [13:04:32] and when I generated the list it was still 2002 [13:05:24] np, I can wait a few min [13:10:46] first 3 switches up [13:11:02] waiting for the other 5 [13:12:53] we should start seeing recoveries [13:14:03] all switches reporting alive [13:14:34] annnnd bast2003 is back [13:14:49] all interfaces are up [13:15:38] can confirm all ml machines ping [13:16:38] everything is good network wise [13:17:58] hm, bastion might still be down. can't SSH to any of the machines [13:18:47] bast2003 works fine for me [13:18:51] looks like a local problem (my machine) [13:18:54] works for me as well [13:19:13] so ok to restart paused services, as far as you can see? [13:19:33] SSH working now for me [13:19:53] XioNoX: I can pool the ores machines, unless you insist :) [13:20:21] klausman: go for is :) [13:20:23] I'm going to re-enable Puppet in codfw/esams/ulsfo now [13:20:28] Device rebooted after 5 years 330 days 20 hours 6 minutes 51 seconds [13:20:41] yeah you can repool services [13:20:59] repooling restbase [13:21:06] repooling DNS in ~10 mins or so then [13:21:10] ores machines pooled. ml-k8s clusters already look fine [13:21:17] repooling maps [13:22:01] I 'll repool all services on the datacenter level in ~30m [13:22:17] XioNoX: that's nice and stable [13:22:30] yeah that was very smooth! [13:23:11] Nearly 6 years is impressive too [13:23:23] Most of my tech hasn't lasted 6 years [13:23:26] In a slightly scary way :) [13:23:52] yeah, we can complain about Juniper it's still good gear, and it's still running [13:25:47] ORES looking good (requests show up and throw no errors) [13:54:37] pooling codfw [16:33:55] uhm.. so puppet code for rsync::server::module seems to have a bug. both the auto_ferm and auto_ferm_ipv6 parameter are set to true.. yet.. firewall rules are in iptables but NOT in ip6tables.. so doesn't work when IPv6 is tried first.. is this new? I feel like I should have ran into this before [16:36:07] the snippet in /etc/ferm/conf.d/ for _ipv6 is created.. but the rules are not in ip6tables -L .... also not after ferm restart.. wut [16:37:28] well, but inside the snipet for v6 there is.. "@resolve((10.64.16.105 10.64.32.184 " heh [16:37:50] IPs in hiera instead of host names? goes looking for that :) [16:41:32] well, we have the gitlab-runners in Hiera..listed like this: [16:41:42] - '10.64.32.184' # gitlab-runner1003 [16:42:18] not going to work well to resolve that for the v6 part.. ack.. [17:12:14] mutante, arnoldokoth: just a heads-up: I've depooled the old-thumbor nodes via confctl so that we're running only on thumbor-k8s. If thumbor makes any noise (probes failing most likely, it'll only be in eqiad) you can add more capacity by running `sudo confctl select service=thumbor,name=thumbor100[1256].eqiad.wmnet set/pooled=yes`. [17:12:28] but most likely they will survive after us bumping resources [17:38:26] hnowlan: Thanks for the heads up. [17:38:28] hnowlan: ok! though would it be one host at a time or all 4 at once? well, keeping an eye out [17:39:22] mutante: if it gets to the point of failing, all 4 at once will be the quickest way to fix things. technically one or two might suffice but it's probably not worth the hassle [17:43:02] hnowlan: ACK [17:43:39] brett: ok to merge your patch? [17:43:45] pybal lvs change in drmrs [17:44:43] brett: I merged, since we don't restart pybal automatically, it should be fine but please note :) [17:45:08] (note that it was merged) [17:55:01] I had disabled puppet on the drmrs lvs instances so they're not applied yet [17:55:15] and now I'm just figuring out my SSH config for drmrs. It's always drmrs :P [17:55:32] I have to run to an appointment but will continue after that [18:02:48] ok thanks [18:52:39] howdy, we (fr-tech) just ran across an issue where we didn't keep up with prod hardware changes and need to update our NTP configs for the new dns servers in codfw. we're wondering if there is an anycast name/address we could use similar to recdns.anycast.wmnet to avoid this in the future. does anyone know if there is one? [18:52:57] sukhe: ^ [18:53:36] dwisehaupt: there is ntp.codfw.wikimedia.org, which as of this morning should point to dns2004 [18:53:39] does that help? [18:54:22] well, partially. since we include this in the PFW config, we need to include ip addresses. [18:57:06] for recdns we are able to use 10.3.0.1 where we can't use the domain name. [18:57:22] sorry, back was decommissioning [18:57:28] np. [18:57:29] yes, I don't think there is an equivalent [18:57:57] but we can certainly create one. I will think about it and follow up. do you have a ticket? please assign me to it if you do [18:58:16] would recdns work since (in theory) the dns servers are also the ntp servers? not sure if there would ever be a split of the services. [18:58:16] I am not sure how or where but I can look into that and discuss with bblac.k [18:58:29] yeah in theory it should, unless I am missing some other corner case [18:59:45] ok cool. i just have a ticket for updating our config for now. i'll think over the use case more and open a ticket if we think it would be necessary for an anycast setup. [19:00:00] but that may be more work than just us updating config every few years. :) [19:00:21] thanks! [19:00:21] so far, all dns servers are also ntp servers so in theory if you do 10.3.0.1 should point you to the ntp server too like you said [19:00:33] however, maybe I am missing something so better to check it is since it is DNS :) [19:01:12] yeah the DNS box work is rare but it happens. also, we periodically do remove a few servers from the DNS pool so there's that [19:01:21] dwisehaupt: please add me to it and I will follow up [19:01:55] willdo. thanks for the info. [19:02:53] --- [19:03:18] we completed some DNS work in codfw today, dns200[4-6] being the new DNS and ns1 hosts [19:03:24] if you see some issues, please ping me [19:03:33] we should be fine but just in case [19:03:34] thanks [19:07:57] :) nice [20:50:54] youyou [20:51:08] (ignore that, wrong window) [23:06:41] ^ alright, bot worked. great. I am off-call now. [23:07:09] we did hava minor incident with MX server queue that is resolved. [23:07:25] report will be published tomorrow. no action items open