[00:00:02] jhathaway: yeah I don't either, something for us both to look into on Monday [00:00:04] GenNotability: thanks [00:00:19] GenNotability: +1 [00:00:19] rzl: recommend reconsidering the impact of the UA filter [00:00:26] have we figured out what proxy service this is? [00:00:28] it broke things, but.. [00:00:41] GenNotability: no, doesn't appear to be a proxy :/ [00:00:59] least none of the usual ones [00:01:18] TheresNoTime: yeah I'm starting a patch for it -- I don't want to roll back out the one we used last time, but I should be able to block everything except oauth traffic [00:01:45] not sure if we'll end up using it but I want to have it ready [00:01:47] but with the wide range of ISP & usage types zombies seem more likely [00:02:02] TheresNoTime: spur doesn't flag anything consistently, but I really would like to figure out what we're looking at - unidentified proxy service, botnet, ??? [00:02:16] * GenNotability fires up Shodan [00:02:38] I saw a few residentials I think, I assumed botnet [00:02:54] yeah mix of residential, corp, and hosting [00:03:00] mostly residential [00:03:36] I guess the question is "what's the point" - as DDoS attacks go this is...wimpy [00:03:45] People are weird? [00:03:47] (great, now I've cursed us all) [00:05:30] ...I really want to know why my private global test filter is looking for the phrase "stab an elephant with a nostril" [00:05:43] ohhhh, that LTA [00:05:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2019.codfw.wmnet with reason: host reimage [00:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:03] the original stuff felt very cartoon villain -- someone scorned by a database taking out their wrath on us [00:06:10] but now it's just nonsense [00:06:56] It feels like on the level of some of the more low-impact vandalism that's usually seen from younger folk so I'd say "scriptkiddie" but that doesn't explain a botnet [00:08:31] setting up an iot botnet is sadly not /that/ difficult [00:08:44] most of the Mirai folks were college students [00:09:04] or even high school ones [00:09:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2019.codfw.wmnet with reason: host reimage [00:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:18] how can I help? [00:11:06] my experience of these types of botnet attacks (in other contexts) is that they tend to not be persistent. either they'll burn through their network pool, or they'll get bored eventually, usually a few days to a week [00:11:39] its somewhat a waste of zombies tbh [00:13:41] skids with entirely-too-large botnets ain't exactly new either [00:14:00] grabbed enwiki filter 1185, log-only as global filter 297 [00:14:16] sadly global filters can't help with most of the large wikis [00:14:16] AntiComposite: no, but they tend to get bored [00:15:00] GenNotability: you also didn't set it as global :P [00:15:13] TheresNoTime: shaddup [00:15:18] if we need to opt-in some large wikis into global filters on an emergency basis I think we should just do it [00:15:21] zabe: not trying to block, just track [00:15:27] legoktm: +1 [00:15:59] how can I help? [00:16:13] I'm still catching up, does frwiki still need help? [00:16:34] I think they implemented an abusefilter, but that is being throttled, so I guess yes [00:16:41] https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Journal_du_filtre_antiabus?wpSearchUser=&wpSearchPeriodStart=&wpSearchPeriodEnd=&wpSearchTitle=&wpSearchImpact=0&wpSearchAction=any&wpSearchActionTaken=&wpSearchGroup=0&wpSearchFilter=29 no further hits [00:16:42] zabe: just trying to keep an eye out for the botnet jumping to a less-patrolled wiki [00:16:50] turns out, the throttling does not disable blocking [00:16:57] just gives annoying notification spam [00:17:05] legoktm: I can monitor performance of systems [00:17:09] ah okay, we can disable throttling on frwiki [00:17:11] we probably should skip wikidata [00:17:15] GenNotability: fair enough, the big wikis usually have activ reporters [00:17:17] should we just disable af throttling globally? [00:17:29] what can I do for wikidata (WD admin)? [00:17:33] rzl: no, I think we concluded that throttling isn't an issue here [00:17:33] ^ rephrase, I'm inclined to do that but listening for objections :) [00:17:47] the throttle doesn't stop the filters so it doesn't seem worth disabling [00:17:48] +1 [00:17:51] throttling only stops "restricted actions", of which disallow isn't affected [00:17:53] I'm also admin if needed but I'm saying global filters should not be enabled for wikidata [00:18:03] legoktm: oh got it, I misunderstood above, thanks [00:18:10] I got it wrong last night too [00:18:10] the throttling "just" sends echo/emails [00:18:13] I've been watching swviewer for the last 15 mins and nothing has popped up yet 🤷‍♂️ [00:18:16] which, isn't fun [00:18:23] this started on wikidata, but the filter would need to be different anyway [00:18:46] AntiComposite: sigh [00:18:49] let me see [00:18:52] looks like frwiki is still blocking at least https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Journal/block [00:19:06] do you have an example? [00:19:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2019.codfw.wmnet with OS bullseye [00:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:12] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2019.codfw.wmnet with OS bullseye comp... [00:19:24] Amir1 on frwiki it's like this https://fr.wikipedia.org/wiki/Spécial:Journal_du_filtre_antiabus/3277035 [00:19:26] also, frwiki has global filters enabled [00:19:54] frwiki admins are blocking old edits. theres nothing ongoing in RC [00:20:08] perryprog: I see. [00:20:17] rephrase, theyre blocking accounts after the attack stopped* [00:20:40] do we have a stew also gblocking the IPs? have they been reusing IPs across wikis? [00:21:22] I randomly looked at a few of the frwikis and didn't see any that were blocked on enwiki (or that hit abusefilter) [00:21:24] I hate to say it but should we ban IP editing in those wikis for a short period of time? [00:21:24] there has been some gblocking, but I haven't seen much reuse [00:21:26] also fwiw, i saw minimal to no ip reuse among not-blocked IPs from the attack yesterday to the one today. [00:21:44] Amir1: you're not going to hear me objecting. [00:21:44] Amir1, wmgEmergencyCaptcha is slightly less drastic and has been effective [00:22:44] there's also still the option of blocking the user agent [00:22:45] we did that in fawiki a while ago (for a week) when some vandalism ended up in national tv and everyone learned they could vandalize wikipedia, fun times [00:22:57] but yeah, captcha is good [00:23:05] it's deployed, what else? [00:23:11] new wikis needed? [00:23:59] It doesn't seem like there's an active attack on any wiki at the moment, from what I can tell. [00:24:18] https://meta.wikimedia.org/w/index.php?title=Special:AbuseLog&wpSearchFilter=297 is empty, so yeah, looks good so far [00:24:42] btw, I don't see anything for wikidata so far https://www.wikidata.org/wiki/Special:RecentChanges?userExpLevel=unregistered&hidebots=1&hidecategorization=1&limit=50&days=30&enhanced=1&urlversion=2 [00:25:41] yeah haven't seen anything on wd today [00:27:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2020.codfw.wmnet with OS bullseye [00:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:00] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2020.codfw.wmnet with OS bullseye [00:29:45] GenNotability: did you end up setting up the global tracking filter? [00:30:08] legoktm: yup, https://meta.wikimedia.org/wiki/Special:AbuseFilter/297 - straight-up rip of enwiki 1185, log-only [00:30:17] have musikbot set to ping me on hits [00:32:26] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:33:00] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [00:33:37] awesome [00:34:10] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:36:48] (03PS1) 10RLazarus: varnish: Block a bad UA [puppet] - 10https://gerrit.wikimedia.org/r/763867 (https://phabricator.wikimedia.org/T302047) [00:37:42] I'm mostly around this weekend and next week (with exception of Monday), ping me if you need anything deployed [00:38:59] (03CR) 10Ladsgroup: [C: 03+1] varnish: Block a bad UA [puppet] - 10https://gerrit.wikimedia.org/r/763867 (https://phabricator.wikimedia.org/T302047) (owner: 10RLazarus) [00:40:16] (03CR) 10CDanis: [C: 03+1] varnish: Block a bad UA [puppet] - 10https://gerrit.wikimedia.org/r/763867 (https://phabricator.wikimedia.org/T302047) (owner: 10RLazarus) [00:41:30] ugh, that filter does not work at all on wikidata... [00:43:41] uh [00:43:43] damn [00:43:45] yup, if you give me some vandalism examples, I can build a new one for wikidata [00:44:01] I think they're back and very xwiki, stand by one [00:44:22] https://fr.wikipedia.org/wiki/Colon?diff=prev&oldid=190984836 [00:44:22] https://sl.wikipedia.org/wiki/Sulili?diff=prev&oldid=5652266 [00:44:22] https://lt.wikipedia.org/wiki/Belgijos_futbolo_var%C5%BEybos_1955%E2%80%931956_m.?diff=prev&oldid=6502999 [00:44:23] (03PS2) 10RLazarus: varnish: Block a bad UA [puppet] - 10https://gerrit.wikimedia.org/r/763867 (https://phabricator.wikimedia.org/T302047) [00:44:23] @GenNotability global emergency captcha? [00:44:27] is captcha on? [00:44:33] yes, back xwiki - https://logstash.wikimedia.org/goto/2dafc22a9a031caa6fb1c3bc8c1b65a4 [00:44:36] Confirmed, they're xwiki [00:44:54] I'm going to deploy the captcha now [00:44:57] Seddon: might be needed [00:45:07] wait, can we do the UA block first please? [00:45:16] (03CR) 10Legoktm: [C: 03+1] varnish: Block a bad UA [puppet] - 10https://gerrit.wikimedia.org/r/763867 (https://phabricator.wikimedia.org/T302047) (owner: 10RLazarus) [00:45:21] or at least, one at a time [00:45:28] (03CR) 10Zabe: [C: 03+1] "lgtm now" [puppet] - 10https://gerrit.wikimedia.org/r/763867 (https://phabricator.wikimedia.org/T302047) (owner: 10RLazarus) [00:45:34] one at a time sounds good to me, I'm awaiting varnish tests atm [00:45:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2020.codfw.wmnet with reason: host reimage [00:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:01] GenNotability: is the FP rate low enough to set the global filter to disallow? [00:46:10] legoktm: uncertain, had one FP on WD [00:46:17] legoktm: I make the patch, keep it stand by [00:46:28] GenNotability: set to warn [00:46:38] that tends to trip up automated things [00:46:41] TheresNoTime: can I also set phasers to stun? [00:46:47] (worked on enwp) [00:46:48] beat me to that joke [00:47:41] (03PS1) 10Ladsgroup: Enable wmgEmergencyCaptcha everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763870 [00:47:52] new type? https://sk.wikipedia.org/w/index.php?title=Saitama_(Saitama)&diff=prev&oldid=7321289&diffmode=source [00:47:55] GenNotability ^ [00:48:11] I'm seeing it on many many wikis [00:48:12] variation [00:48:16] (03CR) 10Seddon: [C: 03+1] Enable wmgEmergencyCaptcha everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763870 (owner: 10Ladsgroup) [00:48:18] ffffffff [00:48:20] Amir1: I'd rather try the UA block first and then captcha everywhere, if we can [00:48:27] given how hostile the captcha is to users in some alphabets [00:48:27] lemme flag down a steward real quick to see if they've changed UAs [00:48:35] very very many wikis [00:48:43] rzl: as I said above "I make the patch, keep it stand by" [00:48:46] ah thanks [00:48:59] how are people following/seeing this in real time? [00:49:00] varnish tests done, merging the UA patch now unless any objections? [00:49:07] es, hr, sq, sl, be [00:49:09] :shipit: [00:49:11] legoktm https://swviewer.toolforge.org [00:49:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2020.codfw.wmnet with reason: host reimage [00:49:15] yep got [00:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:18] go* [00:49:18] none from me [00:49:22] (03CR) 10RLazarus: [C: 03+2] varnish: Block a bad UA [puppet] - 10https://gerrit.wikimedia.org/r/763867 (https://phabricator.wikimedia.org/T302047) (owner: 10RLazarus) [00:49:23] legoktm: https://logstash.wikimedia.org/goto/2dafc22a9a031caa6fb1c3bc8c1b65a4 for me [00:49:44] for me I have a script to easily find other wikis that an IP has edited in the last 24 hours - once seen on one wiki, check others. https://meta.wikimedia.org/wiki/User:DannyS712/FindIPActivity.js [00:50:00] GenNotability should Q105103969 be added to the filter? [00:50:00] see https://global-search.toolforge.org/?q=%22Banish+%5B%5Bd%3AQ105103969%7CVerified+Handles%5D%5D%22&namespaces=&title= [00:50:09] working on it... [00:50:32] I'm trying to jam two different filters together here [00:50:32] rzl: for checking collateral damage, this would be useful https://grafana.wikimedia.org/d/000000170/wikidata-edits?orgId=1&refresh=1m&from=now-3h&to=now [00:50:41] Wikidata uses a lot of OAuth editing [00:50:53] oauth should not be affected [00:51:03] "should" ;) [00:51:05] GenNotability let me know how I can help, I'll stay out of editing the filters to avoid conflicts unless you want a hand [00:51:16] fair point ;) [00:53:07] * perryprog is no longer seeing it [00:56:07] thank you whoever already cleaned up the "banish" edits [00:56:31] it's back, nvm https://sq.wikipedia.org/w/index.php?title=Amos_Oz&diff=2399780&oldid=2368340&uselang=en&redirect=no [00:57:32] Much larger volume this time. GenNotability, check filter? [00:57:58] perryprog: working on it, boss [00:58:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2020.codfw.wmnet with OS bullseye [00:59:01] deployed - irlike failure (left the q capitalized) [00:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:05] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2020.codfw.wmnet with OS bullseye comp... [00:59:59] suggest blocking the qid addition entirely [01:00:24] I hope the filter is private [01:00:31] still seeing it GenNotability https://bg.wikipedia.org/w/index.php?diff=11311591&oldid=11182660&uselang=en&redirect=no&mobileaction=toggle_view_desktop [01:00:43] Should end soon [01:01:11] What does "Banish Verified Handles" even mean? [01:01:23] there's a database called Verified Handles [01:01:24] it's...some database? [01:01:33] aah [01:01:37] apparently they wanted to be included in it, but were rejected [01:01:38] I thought it's the blue tick [01:01:47] lovely [01:01:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2021.codfw.wmnet with OS bullseye [01:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:54] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2021.codfw.wmnet with OS bullseye [01:02:21] GenNotability I'm creating a different filter to just block that qid [01:02:26] DannyS712: go for it [01:02:52] DannyS712: set it to warn, not disallow :) [01:02:59] that seems to be enough* [01:03:01] the UA ban is now on all varnish hosts [01:03:43] if that doesn't cause the edits to drop off sharply I'd love to know :) [01:04:37] checking Amir1's link and it looks like the correct squigglies to me [01:04:46] (wrt collateral damage) [01:05:20] still happening e.g., https://sl.wikipedia.org/w/index.php?title=USS_Reid_(DD-292)&diff=5652307&oldid=5446212&uselang=en&redirect=no [01:05:47] concur [01:06:23] good to know, digging in webrequests to see if the UA changed to something else or if my VCL just failed to catch it [01:06:39] steward around to CU? [01:06:43] that'd be easier [01:06:48] yes please :) [01:07:03] rzl: bad regex i think [01:07:05] created 299 to warn for that addition [01:07:51] seeing a lull [01:08:07] concur [01:08:17] rzl: the regex needs + for UA [01:08:20] right? [01:08:24] ughhhhh [01:08:25] thanks Amir1 [01:08:48] I guess I could do a cu in an extrem emergency, but not sure whether that applies here [01:08:51] though someone beat me to it with disallow, https://meta.wikimedia.org/wiki/Special:AbuseFilter/298 [01:08:55] (what flavour of regex is this?) [01:09:06] global-search gives currently 322 live versions of the edit https://global-search.toolforge.org/?q=%22Banish+%5B%5Bd%3AQ105103969%7CVerified+Handles%5D%5D%22&namespaces=&title= [01:09:27] I'll go clean those up [01:09:36] it's back as https://es.wikipedia.org/w/index.php?diff=141775055&oldid=139481646&uselang=en&redirect=no&mobileaction=toggle_view_desktop [01:09:42] perryprog: how to split it? [01:09:43] PCRE2 mind says something like `^python-requests\/.*$` would be better? [01:09:47] seeing ongoing hits [01:09:53] (03PS1) 10RLazarus: varnish: Block a bad UA, correctly [puppet] - 10https://gerrit.wikimedia.org/r/763873 (https://phabricator.wikimedia.org/T302047) [01:09:59] Amir1 I'm just reverting live edits [01:10:05] er, recent changes edits [01:10:18] (03CR) 10Ladsgroup: [C: 03+1] varnish: Block a bad UA, correctly [puppet] - 10https://gerrit.wikimedia.org/r/763873 (https://phabricator.wikimedia.org/T302047) (owner: 10RLazarus) [01:10:18] TheresNoTime: I considered \W* but I don't want to block a more detailed UA that happens to include python-requests [01:10:21] (03CR) 10CDanis: [C: 03+1] varnish: Block a bad UA, correctly [puppet] - 10https://gerrit.wikimedia.org/r/763873 (https://phabricator.wikimedia.org/T302047) (owner: 10RLazarus) [01:10:26] I did check that it's + and not \+ in this dialect though :) [01:10:27] (03CR) 10Zabe: [C: 03+1] varnish: Block a bad UA, correctly [puppet] - 10https://gerrit.wikimedia.org/r/763873 (https://phabricator.wikimedia.org/T302047) (owner: 10RLazarus) [01:10:31] rzl: ack [01:10:37] that's a lot of +1s [01:10:43] (03CR) 10Jforrester: [C: 03+1] varnish: Block a bad UA, correctly [puppet] - 10https://gerrit.wikimedia.org/r/763873 (https://phabricator.wikimedia.org/T302047) (owner: 10RLazarus) [01:10:52] heh [01:10:53] perryprog: do you want me to create a bot to do it? [01:10:59] (03CR) 10JHathaway: [C: 03+1] varnish: Block a bad UA, correctly [puppet] - 10https://gerrit.wikimedia.org/r/763873 (https://phabricator.wikimedia.org/T302047) (owner: 10RLazarus) [01:11:09] Amir1 DannyS712 might already be on it :) [01:11:13] waiting for the tests, then I'll merge and deploy [01:11:22] someone merge this patch before it breaks +1 record [01:11:41] ah, makes sense [01:11:54] tempted to skip the tests but if there's any file where I'm absolutely not gonna do that, it's text-frontent.inc.vcl.erb :) [01:12:06] *frontend, see?? [01:12:11] (03CR) 10DannyS712: [C: 03+1] varnish: Block a bad UA, correctly [puppet] - 10https://gerrit.wikimedia.org/r/763873 (https://phabricator.wikimedia.org/T302047) (owner: 10RLazarus) [01:12:15] testing is overrated [01:12:17] rzl: you're never going to earn your t-shirt this way [01:12:20] (can't have vandals if all UAs are blocked!) [01:12:33] TheresNoTime: now that's the kind of thinking I'm here for [01:12:40] cdanis: it's reduced to stickers due to budget reasons anyway [01:12:48] wait I figured it out [01:12:55] we just have to make it so that *not* anyone can edit it [01:12:56] Surprised I've only been ratelimited a few times in my reverting [01:12:58] problem solved [01:13:03] "the wiki that nobody can edit" [01:13:10] Wikipedia, the free encyclopedia that cdanis can edit [01:13:27] True, how could no one come up with that until now ;) [01:13:30] hey, I can edit by manually entering records into the database too [01:13:35] good to see you back cdanis :) [01:13:42] $1 fee per edit also edits are stored on a blockchain [01:13:57] perryprog: and people complain about maxlag now... [01:14:07] aaaand I'm unsubscribing from filter notifications [01:14:11] heh [01:14:12] suffice it to say we're getting a lot of hits :P [01:14:20] was wondering how long it would take [01:14:26] I'm probably missing a lot of wikis too since I just have the smallwikis plus a few extra [01:14:38] is the global filter in disallow? [01:14:51] GenNotability: AntiComposite: just put it to warn [01:15:02] that'll probably disrupt a bot enough [01:15:10] still happening at e.g., https://es.wikipedia.org/w/index.php?title=Tupanciretã&diff=141775183&oldid=120322084&uselang=en&redirect=no&diffmode=source [01:15:31] eswiki might be opted out of global filters [01:15:38] certainly [01:15:41] tests passed, merging [01:15:51] (03CR) 10RLazarus: [C: 03+2] varnish: Block a bad UA, correctly [puppet] - 10https://gerrit.wikimedia.org/r/763873 (https://phabricator.wikimedia.org/T302047) (owner: 10RLazarus) [01:16:01] rollout will be a little speedier this time [01:16:08] 1234qwer's is in warn [01:16:50] A while ago I wrote some scripts for globally change js stuff (fix deprecation, etc.) I can reuse most of the code [01:17:18] the only complexity is that it's not based on the global search it just searches each wiki manually [01:17:33] well, automatically, I mean one by one [01:17:47] 6 +1's, not bad. Not sure what the record is. [01:17:59] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:01] we're up to about 500 live instances of the phrase assuming there isn't too much lag at the moment [01:19:14] should see a drop off on the filter log now..? [01:19:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2021.codfw.wmnet with reason: host reimage [01:19:25] So far it's looking clear [01:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:06] no more hits for a minute.. [01:20:17] yep, looks good [01:21:06] I'm clearing the old ones. From global search, run [01:21:06] document.querySelectorAll('td a').forEach( [01:21:06] function ( e ) { e.setAttribute( 'href', e.getAttribute('href') + '?diff=cur&oldid=prev&diffonly=true' ); } [01:21:06] ) [01:21:06] to change each page link to the last diff for easy revert [01:21:33] <3 [01:21:56] !log legoktm@deploy1002 Synchronized private/PrivateSettings.php: T302047 (duration: 00m 49s) [01:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:23:20] varnish rollout complete [01:23:44] sorry for the extra delay, wish that hadn't taken two tries to get right :) [01:23:44] rzl: I owe y'all a beer [01:23:52] rzl: nice work! [01:24:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [01:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2021.codfw.wmnet with reason: host reimage [01:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [01:25:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [01:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:59] https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?viewPanel=2&orgId=1&var-site=codfw&var-site=drmrs&var-site=eqiad&var-site=eqsin&var-site=esams&var-site=ulsfo&var-cache_type=varnish-text&var-status_type=4&var-method=POST&from=now-3h&to=now&refresh=30s [01:26:04] nice spike in 403s [01:26:23] dumb question: does the magic UA filter have enough granularity to block just the edit API? [01:26:38] AntiComposite: nice spike, indeed [01:26:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [01:27:02] GenNotability: it does not [01:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:13] dang [01:27:26] well there was my great idea for the day, have fun kids! [01:27:42] it allows oauth, which should fix most false positives [01:28:01] at least the ones that have been complained about [01:28:53] GenNotability: see what I just posted on the phab ticket [01:28:54] If this doesn't work permanently, and say they work around captcha... what's the next logical deterrent after that [01:29:54] UA again? [01:30:03] Battle of attrition? [01:30:03] Seddon: fairly sure the servers have off switches /s [01:30:13] Seddon: WMF black ops team [01:30:37] perryprog: definitely has limitations [01:30:55] options depend on details I can't see [01:31:34] (details of their requests) [01:32:51] but restricting anon API access is an option, if relevant [01:33:16] !log legoktm@deploy1002 Synchronized private/PrivateSettings.php: T302047 tweaks (duration: 00m 48s) [01:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2021.codfw.wmnet with OS bullseye [01:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:50] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2021.codfw.wmnet with OS bullseye comp... [01:36:03] Anyone that speaks dutch might be able to reply here https://de.wikipedia.org/wiki/Wikipedia:Administratoren/Notizen?diff=cur&oldid=prev&diffonly=true&diffmode=source [01:36:25] perryprog: das ist deutsch [01:36:31] o/ [01:36:54] I was thinking of responding but zabe definitely can do better than mine [01:37:03] what should we respond? [01:37:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [01:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:26] Ack, yes, German not Dutch [01:37:52] i was planning something like "it was a botnet, employees of the wmf are working on it", maybe linking the gerrit change for the techies? [01:38:00] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:38:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [01:38:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [01:38:16] well, I learned an important lesson today: when you set up a system that notifies you of every DoS event, it can be very difficult to turn that system back off [01:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:36] haha [01:38:53] Problem: Be DDoSd. Solution: Set up notification system in case of DDoS. Problem: Be DDoSd. [01:39:06] zabe: botnet distributed vandalism across wikis? [01:39:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [01:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:18] GenNotability: that's me looking at my pager [01:39:31] yeah sound fair [01:40:22] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:40:28] And the backlog of "Banish [[d:Q105103969|Verified Handles]]" is officially at one! https://global-search.toolforge.org/?namespaces=&q=%22Banish%20%5B%5Bd%3AQ105103969%7CVerified%20Handles%5D%5D%22&title=&purge=1 [01:40:30] global search now clear (except for the dewiki discussion) [01:40:33] Thanks everyone for y'all's help. [01:40:43] ^^ [01:40:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2022.codfw.wmnet with OS bullseye [01:40:49] 👍 [01:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:51] I go rest, I have a presentation tomorrow [01:40:54] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2022.codfw.wmnet with OS bullseye [01:41:07] what mitigations are we leaving in place over the weekend, and what do we want to think about rolling back? [01:41:12] I'll be around tomorrow ping me in IRC if needed [01:41:29] rzl: CAPTCHA should be rolled back in my opinion as soon as possible [01:41:32] e.g. with the UA block in place do we want to think about switching emergency captchas off? [01:41:34] ye [01:41:35] https://meta.wikimedia.org/wiki/Special:AbuseLog/1578207 [01:41:42] +1 to switching emergencyCaptcha off [01:41:57] and the throttling turned back on [01:42:01] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:42:04] just in case someone makes an edit filter mistake [01:42:06] DannyS712, what's that showing, most of the folks in here can't see private filters [01:42:16] any objections to doing that promptly, or would we rather wait a while to see if we're stable? [01:42:28] "that" = reverting both emergencyCaptcha and AF throttling to normal [01:42:38] rzl: I think we can do it right now [01:42:46] AntiComposite https://el.wikipedia.org/wiki/Χρήστης:ΔώραΣτρουμπούκη trying to add the wikidata id [01:42:51] oh no [01:42:58] rzl objection - looks like maybe accounts too [01:43:21] account compromise? https://el.wikipedia.org/wiki/%CE%A7%CF%81%CE%AE%CF%83%CF%84%CE%B7%CF%82:%CE%94%CF%8E%CF%81%CE%B1%CE%A3%CF%84%CF%81%CE%BF%CF%85%CE%BC%CF%80%CE%BF%CF%8D%CE%BA%CE%B7 [01:43:27] the abuse filter is still constanly triggering with the edit attempts from ips too [01:44:03] DannyS712: I've not seen any IPs recently, just that ^ account [01:44:21] check global filter 297 hits [01:44:30] (03PS1) 10Andrew Bogott: nfs-mounts.yaml: move utrs to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/763875 (https://phabricator.wikimedia.org/T301280) [01:44:32] (03PS1) 10Andrew Bogott: nfs-mounts.yaml: move wmde-templates-alpha to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/763876 (https://phabricator.wikimedia.org/T301280) [01:44:44] DannyS712: `01:18, February 19, 2022: User:168.158.155.50` was the last IP? [01:45:18] still +1 on disabling EmergencyCaptcha, since it won't affect autoconfirmed accounts [01:45:38] oh, musikbot is still reporting the hits to me but its from a while ago, I guess it was backlogged [01:45:44] :) [01:45:49] yeah it'll do that :) [01:45:52] objection withdrawn, but can we check that account? [01:46:17] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:47:18] (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts.yaml: move utrs to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/763875 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [01:47:51] pretty sure I've done more edits in the last few hours than in the previous few months :) [01:48:06] lol [01:48:11] me too, by a lot [01:48:16] DannyS712: have Operator873 taking a look [01:48:29] I did 0, apparently I'm useless [01:48:29] I already had someone think I was an xwiki abuser on eswiki [01:48:46] Sorry :( I don't know what the backlog problem is with the bot. But otherwise pleased to hear it was of assistance with this incident! [01:49:22] was locked as compromised [01:49:23] Also seems like swviewer isn't reporting edits anymore so that might be an issue if the attack starts up again [01:49:26] musikanimal: it looks like it's just clearing out the EF hits at a rate of ~1/sec before acknowledging unsubscribes [01:49:32] * AntiComposite adds "experienced in reverting cross-wiki vandalism" to his SE statement [01:49:52] "history of dealing with major cross-wiki incidents" [01:50:37] (03PS1) 10CDanis: Revert "Revert "Revert "Disable AbuseFilter throttling on enwiki""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763738 (https://phabricator.wikimedia.org/T302047) [01:50:44] (03PS1) 10CDanis: Revert "Revert "Revert "enable wmgEmergencyCaptcha for enwiki""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763739 (https://phabricator.wikimedia.org/T302047) [01:50:50] (03PS2) 10CDanis: Revert "Revert "Revert "Disable AbuseFilter throttling on enwiki""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763738 (https://phabricator.wikimedia.org/T302047) [01:51:25] is it possible to CU the user in elwiki? to see unsuccessful attempts (was it a dictionary attack? etc.) [01:51:39] Amir1: done by Operator873 I believe [01:51:39] I think Operator is already doing so [01:51:53] let us know of the details [01:51:57] in the ticket [01:52:49] (03CR) 10CDanis: [C: 03+2] Revert "Revert "Revert "Disable AbuseFilter throttling on enwiki""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763738 (https://phabricator.wikimedia.org/T302047) (owner: 10CDanis) [01:53:33] If CU stuff is going on there I guess it's out of the question to ask to be added to the task :P [01:53:34] (03Merged) 10jenkins-bot: Revert "Revert "Revert "Disable AbuseFilter throttling on enwiki""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763738 (https://phabricator.wikimedia.org/T302047) (owner: 10CDanis) [01:54:05] (03PS2) 10CDanis: Revert "Revert "Revert "enable wmgEmergencyCaptcha for enwiki""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763739 (https://phabricator.wikimedia.org/T302047) [01:54:31] (03PS1) 10Andrew Bogott: nfs-mounts.yaml: move huggle to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/763878 (https://phabricator.wikimedia.org/T301280) [01:55:31] (03CR) 10CDanis: [C: 03+2] Revert "Revert "Revert "enable wmgEmergencyCaptcha for enwiki""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763739 (https://phabricator.wikimedia.org/T302047) (owner: 10CDanis) [01:55:53] perryprog: are you trustworthy? [01:56:06] Extremely, I only run two sock rings. [01:56:14] hmmmmmm [01:56:19] *only* [01:56:24] nah, I can't trust you [01:56:31] I don't trust anyone who only runs *two* [01:56:38] noob, only two :p [01:56:40] That's not *commitment* to the movement [01:56:41] last I counted I had 7 accounts [01:56:59] (03Merged) 10jenkins-bot: Revert "Revert "Revert "enable wmgEmergencyCaptcha for enwiki""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763739 (https://phabricator.wikimedia.org/T302047) (owner: 10CDanis) [01:57:16] wait is that... my head hurts [01:57:22] try not to think about it [01:57:22] Amir1 or someone with shell access, question in _security for you [01:57:32] on it [01:57:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2022.codfw.wmnet with reason: host reimage [01:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:08] Oh, TIL I got added to that task [01:58:50] !log cdanis@deploy1002 Synchronized wmf-config/InitialiseSettings.php: disable wmgEmergencyCaptcha and enable AbuseFilter throttling for enwiki aebac8fe1 7618ff941 T302047 (duration: 00m 48s) [01:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:59:22] (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts.yaml: move wmde-templates-alpha to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/763876 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [01:59:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [01:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:00:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:01:01] you count? I'm sure I lost track [02:01:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2022.codfw.wmnet with reason: host reimage [02:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:01:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:02] GenNotability if you're still around, maybe time to disable 297? Its getting a bunch of false positives [02:05:31] DannyS712: done [02:06:00] GenNotability: I think you've made a filter mistake there? Put the second part of the check in the `()` and you'll restrict it all back to IPs again? [02:06:07] Amir1 last time I counted was december 2019, so who knows [02:06:13] (or was that by design?) [02:06:17] (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts.yaml: move huggle to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/763878 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [02:06:19] TheresNoTime: I don't even know anymore [02:06:32] I know the feeling :) [02:06:33] that filter was me hastily throwing together two different enwiki filters [02:06:58] and its actually more like 7 + (1/num SUL wikis) because in theory I own https://meta.wikimedia.org/wiki/Special:CentralAuth/DannyS712_(T235446) which only exists on 1 wiki and I could never log into [02:06:59] T235446: Unable to log into ban.wiki - https://phabricator.wikimedia.org/T235446 [02:07:31] a bunch: https://meta.wikimedia.org/wiki/User:Zabe/Alternate_accounts [02:08:39] right on that note, as its 2am, I am going to bed. Many thanks to rzl and the other SREs who took the page. Y'all don't need me :) ciao! [02:08:51] sounds good, thanks again TheresNoTime [02:09:10] zabe: public ones don't count [02:09:21] afk too [02:09:38] thanks also all the other steward/admin/CU types who turned up, appreciate all your work <3 [02:09:49] TheresNoTime: thanks indeed, have a good night [02:09:55] the new invention, public sockpuppets [02:10:02] :) [02:10:05] :) o/ [02:10:26] will be around for the next couple hours (playing spaceships) if needed, just ping me [02:10:37] anything else needed here from SRE? I think a bunch of folks are drifting away, just want to make sure there's nothing outstanding [02:10:47] lol, good night (am looking forward to your talk tomorrow Amir1 :p) [02:11:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2022.codfw.wmnet with OS bullseye [02:11:10] night [02:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:12] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2022.codfw.wmnet with OS bullseye comp... [02:11:18] fsck here we go again [02:11:26] https://li.wikipedia.org/w/index.php?title=1800&diff=453123&oldid=439086&diffmode=source [02:11:44] rzl incident doc probably needs to be updated [02:11:48] dang it. rzl, GenNotability, Amir1 poke [02:11:50] ehm [02:11:51] AntiComposite: ack [02:11:58] around [02:12:00] aaaaaaaa [02:12:24] stand by, slapping it into the editfilter [02:12:31] There's an INSANE amount [02:12:36] aye [02:12:37] GenNotability: have reenabled [02:12:52] reverted all from that ip [02:13:00] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10Papaul) [02:13:03] plus will autorevert any new edits from them [02:13:05] slapped into 298 [02:13:15] kicking 297 back into not-false-positive mode now [02:13:30] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10Papaul) 05Open→03Resolved complete [02:13:40] oh thats no where near enough - please update the PS patch! [02:13:56] GenNotability: have set 298 to private [02:14:00] might want to page? [02:14:01] it wasn't?! [02:14:06] * GenNotability yells at 1234qwer [02:14:15] legoktm [02:14:19] sup? [02:14:29] There are over a dozen per second of these [02:14:32] probably more [02:14:41] https://meta.wikimedia.org/w/index.php?title=Special:AbuseLog&wpSearchFilter=297 getting hits [02:14:58] "VH should be internet blacklisted (are you in the clear?)" should be added [02:15:01] to the PS patch [02:15:17] k [02:15:18] _this is why I said that wouldn't be an effective filtering method_ [02:15:21] sorry [02:15:31] no, you're right [02:15:51] DannyS712: y'all should maybe move this to _security [02:15:55] I know, but could have phrased that a lot more politely [02:15:59] There's already over 2,000 additions [02:16:02] logged channel [02:16:03] in less than five minutes [02:16:20] maybe it's time for the emergency captcha everywhere [02:16:20] !log ladsgroup@deploy1002 Synchronized private/PrivateSettings.php: T302047 (duration: 00m 48s) [02:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:16:27] ^ [02:16:31] see if this is helping [02:16:32] ok [02:16:36] I'm seeing a pause but I wouldn't hold your breath [02:16:40] :thinking: [02:16:50] good pause [02:16:54] the 403 rate has dropped as well, may want to reevaluate that mitigation [02:17:13] do we still have a CU? Operator873, you still here? [02:17:15] Amir1: appears to have stopped [02:17:28] I'm around [02:17:28] yeah would love the UA string from that last batch, if a CU can grab it for me [02:17:28] perryprog: what am I, chopped liver? (I assume you meant stew :P) [02:17:32] also a bot is probably good for the backlog [02:17:34] er, yes, stew [02:17:37] diving through logs more slowly in the meantime [02:17:45] * Operator873 waves at perryprog [02:17:48] what are we CU'ing? [02:18:02] let's go with https://ab.wikipedia.org/wiki/Цастәи:Вклад/93.99.11.240 [02:18:03] me! [02:18:04] Operator873: https://sat.wikipedia.org/wiki/%E1%B1%B5%E1%B1%AE%E1%B1%B5%E1%B1%B7%E1%B1%9F%E1%B1%A8%E1%B1%A4%E1%B1%AD%E1%B1%9F%E1%B1%B9:89.77.254.201 [02:18:05] the user agent of the bot ips [02:18:05] or that [02:18:05] * AntiComposite gestures broadly [02:18:25] ah never mind, found it [02:18:32] TheresNoTime: if they move to _security I don't get to see things :( [02:18:34] rzl: no longer a Python UA? [02:18:43] Mozilla/5.0 [...] [02:18:43] why is SWViewer giving me other people's undos as things to check! [02:18:46] who is working on the reverting bot? [02:18:59] TheresNoTime: yeah, that was a matter of time but they got there quicker than I was hoping they might :) [02:19:00] DannyS712 disable "Registered" in your filter [02:19:11] TheresNoTime: you know, it's really hard to forge a UA [02:19:16] ;) [02:19:17] lol [02:19:22] thanks [02:19:30] especially when you get hundreds of error messages telling you to check your UA [02:19:46] it has stopped currently [02:19:51] aah [02:20:03] AntiComposite: that hit too hard [02:20:06] Is the filter getting hit? [02:20:13] lemme check [02:20:16] Or did they stop on their own? [02:20:24] `02:16, February 19, 2022: User:89.77.254.201` was the last hit [02:20:29] that checks out [02:20:35] filter's been quiet since...that [02:20:43] the PS block is stopping them [02:20:54] it's coming in at 1 every 3s [02:21:00] https://logstash.wikimedia.org/goto/e25ef9a85ab939c84fa54e4f1f0368b1 [02:21:06] yup [02:21:14] so far I've been reverted once and threatened with a block once cleaning these up. let's see how this wave of revertes goes... [02:21:19] I'm to tired to go through the backlog in swviewer, hopefully someone else will? [02:21:29] Er, PS? [02:21:29] annoyingly i'm filter-restricted on zhwiki [02:21:31] Tamzin threatened by the vandal or by a sysop? [02:21:38] perryprog PrivateSettings.php [02:21:40] ah [02:21:44] shh [02:21:48] again, this needs to be automated, your time is too valuable [02:21:48] by someone reporting me erroneously because they misread the diffs. they apologiez [02:21:49] top sekret [02:21:51] *apologized [02:22:58] just sed -i'' s/VH should be internet blacklisted (are you in the clear\?)/g on the database [02:23:04] /g* [02:23:09] bah, you know what I mean [02:24:03] lmao [02:24:09] wcpgw [02:24:38] ugh, filtered again on zh [02:24:58] that started just as I said I was going to bed [02:24:59] who wants to build and run a minimally tested bot to do over two thousand cross wiki edits [02:25:18] probably would be more accurate than human me [02:25:24] ironically, probably [02:25:51] i mean... kinda. but also don't have a proper IDE rn. but... thinking. all it would do is, what, `for diff in list: api_request_to_revert` ? [02:26:32] Tamzin: let me take a stab at it, I wrote a similar thing for global edit interface (fixing js stuff). Just somehow give me a list of pages and wikis [02:26:33] it would have to do checks to make sure the last edit to the page has a diff that exactly matches newline plus "VH should be internet blacklisted (are you in the clear?)" (or just that it contains that) [02:26:39] Amir1: ref your patch and https://logstash.wikimedia.org/goto/e25ef9a85ab939c84fa54e4f1f0368b1, would it take A Lot Of Work (TM) to *block* those IPs too, or is that too much..? [02:26:44] Amir1 https://global-search.toolforge.org/?namespaces=&q=%22VH%20should%20be%20internet%20blacklisted%20%28are%20you%20in%20the%20clear%3F%29%22&title=&purge=1 [02:26:59] perryprog: https://global-search.toolforge.org/?q=%22VH+should+be+internet+blacklisted%22&namespaces=0&title= [02:27:01] https://global-search.toolforge.org/?q=BotnetMaster&namespaces=&title= [02:27:03] TheresNoTime not too much work [02:27:08] new one maybe? [02:27:16] https://meta.wikimedia.org/wiki/Special:CentralAuth/BotnetMaster [02:27:26] Amir1: ok. let me know if i can help. `:)` [02:27:34] I'll draft one [02:27:40] perryprog: awesome, give me a couple of minutes [02:27:49] lol AntiComposite surely it can't be that use [02:27:51] easy* [02:27:51] AntiComposite: or someone memeing in an inadvisable manner [02:27:55] Tamzin it can be out of date if there's someone else that reverted [02:27:59] AntiComposite: stand by for CU... [02:28:29] yup, new flood with that [02:28:46] https://rn.wikipedia.org/w/index.php?title=Vyanzo&diff=22012&oldid=18618&uselang=en&redirect=no [02:29:04] yeah, it's just as bad as the last one [02:29:09] legoktm, Amir1 GenNotability [02:29:12] we are officially at whack-a-mole [02:29:17] ayyup [02:29:18] back [02:29:20] *ack [02:29:21] already being deployed [02:29:23] I'm working on smth [02:29:23] starting to rethink emergency captcha everywhere :( [02:29:27] no more entries in logstash, they switched again ... [02:29:31] rzl: I'd concur [02:29:34] !log ladsgroup@deploy1002 Synchronized private/PrivateSettings.php: T302047 (duration: 00m 48s) [02:29:36] welcome to hell [02:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:39] the problem is I don't know how long we'd have to leave it on [02:29:41] perryprog: Yeah I just meant as a starting point. Although in theory, doing something like `save(old_text.replace("\nVH should be internet blacklisted (are you in the clear?)", ""))` ought to work, right? it'd just be a null edit if the string doesn't appear [02:29:54] (where `save` is an arbitrary API-calling function) [02:30:19] rzl: or more importantly how long it would give us. But we can't keep this up all weekend. [02:30:26] Tamzin true [02:30:27] Seddon: yeah agreed on both counts [02:30:44] agreed, global captcha [02:31:03] at least buys us much more time than anything else [02:31:08] AntiComposite: string added to global warn filter [02:31:14] Amir1, legoktm: do you have anything else in-flight or can you switch emergency captcha on? [02:31:17] emergencyCaptcha really gets attrictive now [02:31:19] are we seeing IP address reuse, seeing as they're not getting (globally) blocked? [02:31:26] rzl: I can do it [02:31:27] I have something [02:31:31] Amir1: cheers [02:31:35] I wait then [02:32:06] they *cannot* have unlimited IP addresses to burn [02:32:09] TheresNoTime: some reuse but still enough IPs that I'd hesitate to block them all [02:32:10] only 500 additions for the last wave [02:32:23] TheresNoTime you're right, IPv4 isn't that large ;) [02:32:29] TheresNoTime: are we limited in terms of users? Like is pressganging a temp global admin/rollbacker crew useful in anyway? [02:32:49] TheresNoTime: Has someone gone back and lengthened all those 31h blocks to something longer? Because I get a feeling if we don't, we'll see repeat customers soon. [02:32:52] Seddon: temp global rollbacker would be useful to give out [02:32:53] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [02:32:57] TheresNoTime: just block /0 [02:32:58] I don't think global rollback isn't needed. I have no global perms and have been fine. [02:33:07] perryprog: wait until IoT gets everywhere, we are doomed [02:33:09] perryprog: would help with some wikis [02:33:12] true [02:33:18] * Tamzin would benefit from temp global rb, just in terms of her workflow, but can also make do without [02:33:50] and would hopefully exempt me from that pesky filter on zhwiki? but idk, depends on what's in the filter i guess [02:33:57] legoktm: nice patch, +1 [02:34:05] honestly I've just been skipping zhwiki in my reverts [02:34:08] We would need that call to from the stewards but just thinking of what options we have before we start escalating [02:34:35] It's impossible to know what their next steps will be. We should start talking to stewards sooner rather than five of these waves later. [02:34:36] Now seems the time to identify all the options we have and get as much in place [02:34:46] Agreed [02:35:18] On the WMF side I think we would also want to start giving more people a heads up about things [02:35:27] Seddon: yeah I was just thinking about that [02:35:51] I'm writing the code [02:36:06] yeah, definitely time to start escalating [02:36:11] Seddon: this needs to escalate to higher managment [02:36:12] especially since we're entering the weekend [02:36:21] normally my first point of escalation would be SRE directors but they're likely both asleep -- will wake them up if need be but I don't think this needs them specifically [02:36:53] (and they've asked to be woken up as needed, to be clear -- this is just right on the border of Stuff SRE Owns anyway) [02:36:54] rzl: My thoughts are comms and trust & safety [02:37:02] yeah agree [02:37:04] Seddon: irt stewards I've casually mentioned this in ##stew [02:37:18] comms and/or commrel, not sure which [02:37:25] "yes" [02:37:27] Both [02:37:27] would probably err towards both I guess [02:37:29] when in doubt... [02:38:15] ...nuke it from orbit? [02:38:31] well if we're gonna do that, let's do it first, it saves a lot of other trouble [02:38:36] also musikanimal gets the award for tooling most used in making half of the mitigation we've been doing even possible (globalsearch, musikbot, what else?) [02:39:37] rzl: Seddon comms if we go with captcha option [02:39:43] yeah true [02:40:34] TheresNoTime patch on the task for globally blocking the ips for 30 days a open proxies [02:40:54] I would prefer a signoff from a steward on that [02:41:10] 30 days seems like a lot [02:41:11] sysadmin trumps steward [02:41:15] zabe sorry I'm distracted [02:41:18] on what? [02:41:44] on the 30days proxy thing? [02:41:45] DannyS712: would suggest 5 days? [02:41:47] sure, but then I wouldn't take '[[m:NOP|Open proxy]]: See the [[m:WM:OP/H|help page]] if you are affected' as the reason [02:42:16] Operator873 yeah - how long should these be blocked? [02:42:35] Operator873, yes [02:42:41] my default for zombies is usually 3-7 days unless behavior says otherwise [02:42:50] They are clearly anonymizing hosts. 30 days may be a bit much, but it should be long enough to avoid replay. Lets do 1 week [02:42:52] considering some of these ranges are highly dynamic [02:42:57] tallyho, new wave [02:43:05] " will return. That non-profit deserves be deleted off the face of this earth." [02:44:06] DannyS712, is there a reason you changed the regex? [02:44:21] wave is hitting the global filter currently [02:44:29] Just saying again, this channel is publicly logged, yes? [02:44:34] yes [02:44:36] GenNotability: the code I'm writing reverts any edit from a global search result [02:44:41] clean up will be easy [02:44:55] it's been fully blocked so far [02:44:57] @zabe I was just copying from an earlier version [02:45:00] Amir1: nice! though the ones hitting the filter aren't getting through :P [02:45:03] Is there a secure channel this can be coordinated with? [02:45:08] +1 [02:45:15] _security, but that's not exactly great for general use [02:45:17] for that lego is working on something [02:45:19] You'll lose a few people if so [02:45:34] as much as I doubt anyone is sitting and reading this, I'd suggest exact filter/regex/etc changes etc are not discussed here [02:45:40] you'll actually lose most in this case [02:45:45] I think that'd be the best method [02:45:52] keep the deetails to a minimum [02:45:53] make a new channel, invite current participants, lock it down? [02:46:21] i've reserved #wikimedia-bot-attack [02:46:29] !log legoktm@deploy1002 Synchronized private/PrivateSettings.php: (no justification provided) (duration: 00m 37s) [02:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:52:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:53:43] started the revert bot https://an.wikipedia.org/w/index.php?title=Ch&diff=prev&oldid=1811391 [02:53:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:53:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:02] Amir1: <3 [02:54:06] in new cases, just give me the search result url [02:55:03] <3 [02:55:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:00:05] !log legoktm@deploy1002 Synchronized private/PrivateSettings.php: (no justification provided) (duration: 00m 48s) [03:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:40] !log legoktm@deploy1002 Synchronized private/PrivateSettings.php: (no justification provided) (duration: 00m 31s) [03:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:27] !log legoktm@deploy1002 Synchronized private/PrivateSettings.php: (no justification provided) (duration: 00m 47s) [03:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:30:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:31:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:31:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:32:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:08] (03PS1) 10RLazarus: Revert "varnish: Block a bad UA" [puppet] - 10https://gerrit.wikimedia.org/r/763883 (https://phabricator.wikimedia.org/T302047) [03:52:37] (03CR) 10AntiCompositeNumber: [C: 03+1] Revert "varnish: Block a bad UA" [puppet] - 10https://gerrit.wikimedia.org/r/763883 (https://phabricator.wikimedia.org/T302047) (owner: 10RLazarus) [03:56:25] (03CR) 10RLazarus: [C: 03+2] Revert "varnish: Block a bad UA" [puppet] - 10https://gerrit.wikimedia.org/r/763883 (https://phabricator.wikimedia.org/T302047) (owner: 10RLazarus) [04:21:11] PROBLEM - SSH on aqs1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:43:00] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [06:23:41] RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:32:53] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [06:42:41] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:05:31] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:39:21] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:41:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [07:46:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [08:06:51] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:40:41] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:40:19] PROBLEM - Host asw1-b13-drmrs.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:40:27] PROBLEM - Host netflow6001 is DOWN: PING CRITICAL - Packet loss = 100% [09:40:31] PROBLEM - Host ncredir6002 is DOWN: PING CRITICAL - Packet loss = 100% [09:40:35] PROBLEM - Host prometheus6001 is DOWN: PING CRITICAL - Packet loss = 100% [09:42:24] PROBLEM - LVS text drmrs port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv6 #page on text-lb.drmrs.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:42:42] PROBLEM - LVS upload drmrs port 80/tcp - Images and other media- upload.eqiad.wikimedia.org IPv6 #page on upload-lb.drmrs.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:42:51] PROBLEM - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/summar [09:42:51] } (Get summary from storage) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) timed out before a re [09:42:51] as received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a [09:42:51] was received https://wikitech.wikimedia.org/wiki/RESTBase [09:43:00] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:43:55] it's drmrs, please ignore... [09:44:24] Ack [09:44:34] PROBLEM - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.drmrs.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:44:34] * volans here [09:44:42] XioNoX: ack, need a hand? [09:45:17] I'm on my phone, can someone please downtime it all? [09:45:35] XioNoX: ack, doing [09:45:35] PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:46:08] I guess the telxius transport link went down? [09:46:32] can we also disable all the paging alerts for drmrs? [09:46:40] I can have a look [09:47:02] XioNoX: asw1-b13-drmrs ping failed fwiw [09:47:08] RECOVERY - LVS text drmrs port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv6 #page on text-lb.drmrs.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 623 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:47:24] RECOVERY - LVS upload drmrs port 80/tcp - Images and other media- upload.eqiad.wikimedia.org IPv6 #page on upload-lb.drmrs.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 491 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:47:37] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_drmrs02_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:28] RECOVERY - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.drmrs.wikimedia.org_ipv6 is OK: OK - Certificate *.wikipedia.org will expire on Thu 17 Nov 2022 11:59:59 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:51:32] PROBLEM - Maps edge drmrs on upload-lb.drmrs.wikimedia.org is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) timed out before a response was received: /private-info/info.json (private tile service info for osm-intl) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) time [09:51:32] fore a response was received: /v4/marker/pin-m+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m+ffffff@2x.png (Untitled test) timed out before a response was received: /_info (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps/RunBook [10:02:26] volans: thanks! [10:02:56] XioNoX: I think I've downtime all the alerting ones until monday 1h from now [10:03:24] I'm checking puppet if there is a quick way to disable paging just for drmrs, but are you sure that there isn't anything critical there yet? [10:03:35] yep [10:32:53] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [10:56:21] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:17] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_drmrs02_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:26] downtimed this one too as it's flapping, until monday morning [11:26:15] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:30] (03PS2) 10Ladsgroup: Enable wmgEmergencyCaptcha everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763870 [11:49:03] (03CR) 10Ladsgroup: "PS2 is manual rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763870 (owner: 10Ladsgroup) [11:52:55] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 109 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:05:05] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 109 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:07:27] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 34 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:17:11] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 110 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:20:17] ^Is this known? [12:20:27] yes [12:20:38] kind of [12:22:10] I couldn't find anything, so I filed https://phabricator.wikimedia.org/T302152 [12:22:20] Oh wait [12:22:30] I didn't read the error message at first [12:22:41] It says "Patsoid", here we go with the classic opcache corruption? [12:23:05] <_joe_> oh damn, yes [12:23:27] :-D oh my [12:23:34] yep [12:24:39] Sigh. [12:24:59] <_joe_> !log restarted php-fpm on wtp1027 [12:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:49] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:30:46] <_joe_> sigh my bad [12:30:57] <_joe_> I didn't look at the exception as I was concentrated on other stuff [12:44:49] PROBLEM - Puppet CA expired certs on puppetmaster1001 is CRITICAL: CRITICAL: 1 puppet certs need to be renewed: https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [12:47:03] wut? [12:47:04] checking [12:52:32] it's for etcd.codfw.wmnet, it expires on the 26th, opening a task, can wait monday [12:54:56] 10SRE, 10serviceops: Renew puppet cert for etcd.codfw.wmnet - https://phabricator.wikimedia.org/T302153 (10Volans) p:05Triage→03Medium [12:55:33] not acking the alert to allow it to fire again in case there will be other certs expiring [13:43:00] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [14:02:10] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Daimona) [14:32:53] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [14:37:57] (03PS6) 10Giuseppe Lavagetto: conftool: add request-actions / request-patterns [puppet] - 10https://gerrit.wikimedia.org/r/763486 [14:37:59] (03PS2) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 [14:41:01] RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:05:25] (03PS3) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 [16:29:13] (03PS1) 10Andrew Bogott: wmcs-cinder-backup-manager: add more nfs volume backups [puppet] - 10https://gerrit.wikimedia.org/r/763909 (https://phabricator.wikimedia.org/T301280) [16:31:11] (03PS4) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 [16:36:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:37:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:58] !log ladsgroup@deploy1002 Synchronized private/PrivateSettings.php: (no justification provided) (duration: 00m 48s) [16:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:13] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-backup-manager: add more nfs volume backups [puppet] - 10https://gerrit.wikimedia.org/r/763909 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [16:40:53] !log ladsgroup@deploy1002 Synchronized private/PrivateSettings.php: (no justification provided) (duration: 00m 48s) [16:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:49:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:15] PROBLEM - Recursive DNS on 2a02:ec80:600:2:185:15:58:37 is CRITICAL: DNS_QUERY CRITICALError response or zero answers: https://wikitech.wikimedia.org/wiki/DNS [17:41:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [17:43:00] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:43:16] that's drmrs [17:43:32] Someone might want to downtime with the rest of it until Monday [17:46:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [18:04:51] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:32:53] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [19:10:42] 10SRE, 10SRE-Access-Requests, 10SRE-OnFire, 10WMF-Legal: Grant Zabe access to the T302047 incident report - https://phabricator.wikimedia.org/T302163 (10Zabe) As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expectations: Please note that public tasks in Wikimed... [19:20:23] !bash Problem: Be DDoSd. Solution: Set up notification system in case of DDoS. Problem: Be DDoSd. [19:20:23] Amir1: Stored quip at https://bash.toolforge.org/quip/BtJtE38B1jz_IcWulsRV [19:25:41] 10SRE, 10SRE-Access-Requests, 10SRE-OnFire, 10WMF-Legal: Grant Zabe access to the T302047 incident report - https://phabricator.wikimedia.org/T302163 (10Zabe) [19:27:01] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:30:17] I feel personally attacked by that, perryprog ;) [19:30:53] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01028 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:31:03] :) [19:33:35] RECOVERY - Host logstash2028.mgmt is UP: PING WARNING - Packet loss = 90%, RTA = 35.59 ms [20:06:33] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:30:35] 10SRE, 10SRE-OnFire: Grant Zabe access to the T302047 incident report - https://phabricator.wikimedia.org/T302163 (10Aklapper) 05Open→03Resolved This has happened by subscribing [20:31:12] 10SRE, 10SRE-OnFire: Grant Zabe access to the T302047 incident report - https://phabricator.wikimedia.org/T302163 (10RhinosF1) 05Resolved→03Open Zabe means the actual doc not the task [20:32:31] 10SRE, 10SRE-Access-Requests, 10SRE-OnFire, 10WMF-Legal: Grant Zabe access to the T302047 incident report - https://phabricator.wikimedia.org/T302163 (10RhinosF1) [20:50:12] 10SRE, 10SRE-Access-Requests, 10SRE-OnFire, 10WMF-Legal: Grant Zabe access to the T302047 incident report - https://phabricator.wikimedia.org/T302163 (10Aklapper) Ah, sorry! [21:43:00] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:30:55] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown, 10User-DannyS712, 10affects-Kiwix-and-openZIM: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10matmarex) [22:32:53] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [22:58:34] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown, 10User-DannyS712, 10affects-Kiwix-and-openZIM: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10AlexisJazz) https://en.wikipedia.org/wiki/%3B_ doesn't work either...