[00:00:15] <icinga-wm>	 RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:00:59] <legoktm>	 I skimmed https://www.mediawiki.org/wiki/MediaWiki_1.38/wmf.17, seems fine
[00:01:00] <dduvall>	 legoktm: thanks for the update. i see twentyafterfour stepped in as well. <3
[00:01:55] <legoktm>	 I feel comfortable that the revert is likely fine, but I haven't been paying attention for the past two weeks, so I'm more at 75% confidence
[00:02:32] <James_F>	 That's probably sufficient.
[00:02:48] <dduvall>	 alright. let's deploy the reverts. if anything else goes wrong or it's not fixed, let's roll the train back and wait until tuesday
[00:02:54] <James_F>	 WFM.
[00:02:57] <brennen>	 i'm around as well.  i am...  not super plugged into what's been going on, but i can watch logs for a bit.
[00:03:00] <James_F>	 dduvall: Are you doing or should I?
[00:03:06] <dduvall>	 James_F: go for it :)
[00:03:10] <James_F>	 Whee.
[00:03:20] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] "Emergency deploying." [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754046 (https://phabricator.wikimedia.org/T299244) (owner: 10Legoktm)
[00:04:18] <James_F>	 (There goes my personal 'helping other teams' budget for the week, oh well.)
[00:04:20] <legoktm>	 sounds good, while jenkins runs I'll grab some snacks
[00:05:05] <hauskatze>	 and I'll take it from you legoktm :P
[00:05:44] <James_F>	 This is a pretty broad commit set. Is scapping it going to actually work, or will it choke on the canaries?
[00:06:31] <James_F>	 If not, should we first rollback the train from group2, then deploy it, then re-roll?
[00:06:34] <James_F>	 (Gah.)
[00:06:55] <legoktm>	 I think if you sync includes/deferred/LinksUpdate.php first, and then sync all it'll cut down on most errors
[00:06:56] <urbanecm>	 James_F: it changes namespacing of class AFAICS. It will very likely throw a lot on canaries
[00:07:19] <legoktm>	 oh, bleh
[00:07:23] <James_F>	 legoktm: Won't it clash in the autoloader?
[00:07:24] <James_F>	 Yeah.
[00:07:40] <James_F>	 dduvall: Rollback of the train is now a single command, right?
[00:07:40] <legoktm>	 well, at least it won't be trying to load a deleted file? :)
[00:07:41] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Revert "LinksUpdate refactor" and follow-ups [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754046 (https://phabricator.wikimedia.org/T299244) (owner: 10Legoktm)
[00:08:08] <legoktm>	 personally I'd rather we swallow the exceptions rather than mess with moving the train back and forth
[00:08:24] <legoktm>	 most of this runs during the job queue so it won't be as user facing and any failures will be retried
[00:08:25] <James_F>	 A force-deploy scap of mediawiki/includes?
[00:08:44] <James_F>	 But the job-injection code will also throw errors.
[00:09:09] <dduvall>	 it's not a whole lot of work to rollback the train. two commands
[00:09:13] <dduvall>	 if you want me to do that
[00:09:14] <James_F>	 ack.
[00:09:18] <subbu>	 legoktm, no need to unnecessary shortcuts.
[00:09:21] <subbu>	 what dduvall said.
[00:09:27] <James_F>	 dduvall: Could you? 
[00:09:30] <Krinkle>	 job inject is in deferred updates afaik
[00:09:33] <dduvall>	 sure thing
[00:10:08] <legoktm>	 James_F: ugh, right.
[00:10:20] <James_F>	 legoktm: Isn't everything terrible?
[00:10:21] <legoktm>	 Yeah, I guess moving back will lessen the impact
[00:10:29] <James_F>	 All the way to group0?
[00:10:30] <taavi>	 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/754047/ was just pushed to gerrit, claiming to fix the bug
[00:10:44] <dduvall>	 was just going to ask. rollback group1 as well?
[00:11:07] <James_F>	 dduvall: … yeah, let's.
[00:11:28] <subbu>	 legoktm, Krinkle see taavi's message there ^
[00:11:31] <James_F>	 OTOH, if we're temporarily rolling back anyway, do we also want to deploy this and re-roll, or should we just leave it?
[00:12:00] <legoktm>	 taavi: I suspect Umherirrender's analysis and patch is correct, but I'm not super comfortable leaving the LinksUpdate code around even with that given the amount of regressions we've seen so far
[00:12:00] <James_F>	 Umherirrender's patch makes sense to me but I can't test locally. Can someone?
[00:12:01] <Krinkle>	 I'd rather give Tim time to look over the proposed fix and some of the other aspects in the task
[00:12:09] <James_F>	 +1
[00:12:10] <legoktm>	 ^^
[00:12:24] <Krinkle>	 I'm for rollback just the refactor. It's a drop in replacement and forward/back compat, should be fine
[00:12:49] <taavi>	 legoktm: makes sense
[00:13:30] <Krinkle>	 as for atomic deploy, I think it's unfortunate that we still haven't either made scap use git fetch/checkout, nor have we enabled fpm graceful restarts with revalidation disabled, which indeed means deploying anything non-trivial is always ungraceful even without --force.
[00:14:11] <logmsgbot>	 !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: Revert "all/group1 wikis to 1.38.0-wmf.17"
[00:14:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:14:23] <dduvall>	 yeah, sigh. `scap deploy` uses git but sadly we never migration mediawiki to use it
[00:15:02] <James_F>	 Obviously mw-on-k8s will make things much better, but with a few months' work we could have got from bad to good before we get to great. Oh well.
[00:15:11] <dduvall>	 yep
[00:15:31] <dduvall>	 perfect is often the enemy of good
[00:15:40] <dduvall>	 or better
[00:15:43] <James_F>	 There's no logstash key for the cl_from=0 right? It's just silently failing to update anything meaningful.
[00:16:12] <James_F>	 (Perhaps the DB layer should throw when the page ID pointer is a nonsense value, but that's another thing for the post-incident review.)
[00:16:37] <dduvall>	 k. James_F, legoktm: group1/group2 are on wmf.16 now
[00:16:40] <James_F>	 Ack.
[00:16:42] <James_F>	 Thanks dduvall 
[00:17:01] * James_F glares at zuul.
[00:17:02] <Krinkle>	 I see there have been no config pathces since the wmf.17 promotion. If that hadn't been the case, I'd say it's more risky to use the train as a clever trick to avoid errors. I'd say --force would have been  normal, not a shortcut, given that normal means work always, and using the train doesn't always work, but in this case it seems like the best possible option indeed :)
[00:17:22] <taavi>	 I'm offline for the night, thanks everyone
[00:17:28] <James_F>	 thanks for all your help taavi 
[00:17:37] <hauskatze>	 gn8 taavi
[00:17:39] <legoktm>	 o/
[00:17:50] <dduvall>	 are we anticipating moving back to group1/group2 imminently? if so, i'll leave the reverts in /srv/mediawiki-staging and discard when it's time to move back
[00:18:04] <James_F>	 dduvall: Yes.
[00:18:07] <dduvall>	 k
[00:18:10] <legoktm>	 I think as soon as James_F syncs it we should move forward again
[00:18:15] <dduvall>	 ack
[00:18:16] <James_F>	 Well.
[00:18:23] <James_F>	 First we might want to check that things work on group0. ;-)
[00:18:25] <legoktm>	 with some breathing room :)
[00:18:29] * James_F grins.
[00:22:41] <James_F>	 OK, we're in the final furlong of the CI run.
[00:23:53] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "LinksUpdate refactor" and follow-ups [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754046 (https://phabricator.wikimedia.org/T299244) (owner: 10Legoktm)
[00:26:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[00:26:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:27:01] <wikibugs>	 (03PS1) 10Catrope: doc.wikimedia.org CSP: Allow XHR requests to Wikipedia and Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/754048 (https://phabricator.wikimedia.org/T285570)
[00:27:09] <subbu>	 Krinkle, noted reg. config & shortcuts. thanks for the explanation!
[00:27:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[00:27:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[00:27:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:27:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:27:33] <wikibugs>	 (03CR) 10Catrope: [C: 04-1] "Please do not merge without review and approval from the Security team" [puppet] - 10https://gerrit.wikimedia.org/r/754048 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope)
[00:27:52] <legoktm>	 I suppose we can test using mw-on-k8s now :)
[00:28:30] <James_F>	 OK, this is now live on mwdebug1002.
[00:28:35] <James_F>	 Anyone else is welcome to test too.
[00:28:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[00:28:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:28:56] <James_F>	 (But only on group0.)
[00:29:29] <James_F>	 Of course the jobs won't go through the debug server, but at least the triggering of them can be tested.
[00:30:36] <legoktm>	 on test.wp, I created [[What even]] with contents [[Category:What]], saw it appear, then deleted the page and saw it disappear from the category
[00:31:24] <James_F>	 Hmmmmm.
[00:31:46] <James_F>	 I just deleted a page there and on deletion I got the message "The page or file ‘Why’ could not be deleted. It may have already been deleted by someone else." with a deletion log of me deleting it.
[00:31:55] <James_F>	 Maybe I accidentally double-clicked delete?
[00:32:02] <Krinkle>	 links update, and deletedlinksupdate run in deferred post-send as first attempt
[00:32:06] <Krinkle>	 become jobs if they fail.
[00:32:13] <Krinkle>	 (+ cascading updates > jobs)
[00:32:15] <James_F>	 Krinkle: Oh, right, that explains them working on testwiki.
[00:32:16] <James_F>	 Ack.
[00:32:31] <legoktm>	 and {{PAGESINCATEGORY:...}} has the correct value
[00:33:20] <James_F>	 Not right now it doesn't.
[00:33:34] <James_F>	 Pages in category: 0 | Pages in category ‘What’ This category contains only the following page. W Why not
[00:33:48] <James_F>	 Eurgh.
[00:33:59] <James_F>	 MW-SNAFU of category counting?
[00:34:31] <legoktm>	 hrm
[00:35:08] <legoktm>	 no, I screwed up the syntax
[00:35:25] <legoktm>	 it's "PAGESINCATEGORY:What" not "...Category:What"
[00:35:40] <James_F>	 Ha.
[00:36:04] <James_F>	 Yeah, I got 'The page or file ‘Why not’ could not be deleted. It may have already been deleted by someone else.' again.
[00:36:06] <James_F>	 But it deleted.
[00:36:19] <James_F>	 And the category count and membership updated correctly.
[00:36:29] <James_F>	 OK, let's sync this. Agreed, legoktm, Krinkle?
[00:36:33] <legoktm>	 +1
[00:36:37] <James_F>	 Ack.
[00:37:02] <James_F>	 Should I just do a real sync-world?
[00:37:21] <wikibugs>	 (03PS1) 10Dduvall: Revert "all wikis to 1.38.0-wmf.17  refs T293958" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754049
[00:37:23] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] Revert "all wikis to 1.38.0-wmf.17  refs T293958" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754049 (owner: 10Dduvall)
[00:37:25] <wikibugs>	 (03PS1) 10Dduvall: Revert "group1 wikis to 1.38.0-wmf.17  refs T293958" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754050
[00:37:27] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] Revert "group1 wikis to 1.38.0-wmf.17  refs T293958" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754050 (owner: 10Dduvall)
[00:37:29] <dduvall>	 don't mind ^ 
[00:37:29] <James_F>	 dduvall: Wait, not yet.
[00:37:38] <dduvall>	 just the reverts
[00:37:40] <James_F>	 Oh, right, that's just clean-up, never mind. :-)
[00:37:41] <James_F>	 Yeah.
[00:38:01] <dduvall>	 i realized that if we're going back just to group1 i'd have to fiddle with the git HEAD and i decided not to do that
[00:38:09] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "all wikis to 1.38.0-wmf.17  refs T293958" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754049 (owner: 10Dduvall)
[00:38:13] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.38.0-wmf.17  refs T293958" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754050 (owner: 10Dduvall)
[00:38:22] * James_F nods.
[00:38:50] <James_F>	 dduvall: Do you think a scap sync-world is better than syncing includes/ and then the autoloader and then other things and hoping the order is roughly right?
[00:39:06] <James_F>	 (My kingdom for atomic deploys.)
[00:39:51] <dduvall>	 without knowing the details here, i can't really say
[00:40:06] * James_F nods.
[00:40:08] <dduvall>	 my guess is that either is ok since we're still on group0
[00:40:13] <James_F>	 True.
[00:40:17] <James_F>	 Let's do the safe thing then.
[00:40:35] <legoktm>	 I'd personally try sync-file and if it fails do sync-world
[00:41:06] <James_F>	 I'd be nervous of that getting into a messy situation.
[00:41:18] <James_F>	 I guess we could sync-file --force ?
[00:41:23] <James_F>	 But I'd really rather avoid.
[00:41:59] <dduvall>	 sync-world shouldn't be _that_ slow with everything already out
[00:41:59] <legoktm>	 I don't think it would be that messy, but up to you
[00:42:39] <logmsgbot>	 !log jforrester@deploy1002 Started scap: Revert "LinksUpdate refactor" and follow-ups for T299244 re. T293958
[00:42:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:42:44] <stashbot>	 T299244: {{PAGESINCATEGORY:Wikipedia:Nuweg}} not decreased when page is deleted - https://phabricator.wikimedia.org/T299244
[00:42:44] <stashbot>	 T293958: 1.38.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T293958
[00:42:45] <James_F>	 Going with a full scap.
[00:43:56] <dancy>	 btw, what makes you suggest --force?
[00:44:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[00:44:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:42] <Krinkle>	 dancy: because sync isn't atomic and there's class name changes and autoloader changes etc.
[00:44:46] <James_F>	 dancy: If the autoloader is referring to a file and then we sync the includes/ directory such that some requests will hit a file that no longer exists, I've run into the state of the initial sync passing the canaries but subsequent ones always failing as jobs get retried.
[00:44:59] <James_F>	 dancy: Aka this is why I would drink if I did. ;-)
[00:45:15] <dancy>	 so aspect of --force are you looking for? Unconditional php-fpm restart?
[00:45:19] <dancy>	 *What aspect
[00:45:22] * Krinkle looks at a bottle of "stroomwafel liqour" on his kitchen counter
[00:45:32] <dduvall>	 dancy: bypassing canary failure i believe
[00:45:34] <James_F>	 The "ignore what the canaries say, I really need this change synced to the machines".
[00:45:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[00:45:49] <dancy>	 Ah, I didn't realize the canaries were problematic.
[00:45:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[00:45:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:45:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:45:57] <Krinkle>	 (and no James_F , I didn't buy it. I rarely drink.)
[00:45:58] <James_F>	 OK, the scap sync-world has done everything except the cdb-rebuild.
[00:45:59] <dancy>	 Good times.
[00:46:37] <logmsgbot>	 !log jforrester@deploy1002 Finished scap: Revert "LinksUpdate refactor" and follow-ups for T299244 re. T293958 (duration: 03m 58s)
[00:46:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:46:44] <James_F>	 And we're done. Fastest serious sync-world in history?
[00:46:51] <James_F>	 OK, time to test again.
[00:46:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[00:46:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:47:04] <Krinkle>	 wow, fast indeed.
[00:47:13] <ori>	 Krinkle: omg, it's actually a thing: https://www.totalwine.com/spirits/liqueurscordialsschnapps/chocolate-sweets-candy/caramel/van-meers-stroopwafel-liqueur/
[00:47:18] <James_F>	 And much less risk of James having a heart attack.
[00:47:19] <dduvall>	 yeah, it's really not that bad. we've talked about making everyone use it all the time :)
[00:47:21] <dancy>	 I have spent some time shaving seconds of sync-world time over the last few montsh
[00:47:24] <dancy>	 *months
[00:47:27] <James_F>	 ori, Krinkle: Good grief.
[00:47:30] <Krinkle>	 ori: that's the one
[00:47:33] <icinga-wm>	 RECOVERY - SSH on restbase2011.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:47:38] <James_F>	 dancy: You sir are a gent amongst humans.
[00:47:41] <dduvall>	 that's right. thanks to dancy 
[00:48:07] <dduvall>	 Krinkle: that sounds a little too delicious
[00:48:19] <dduvall>	 like, i'd need to clear my calendar
[00:48:20] <dancy>	 of course, soon we'll turn php-fpm restart on and  all syncs will take 3 minutes.  Enjoy it while it lasts.
[00:48:35] <James_F>	 :-)
[00:48:37] <legoktm>	 I restored and re-deleted [[What even]] and the category page updated properly, showing it and then it disappearing
[00:48:48] <ori>	 \o/
[00:48:49] <James_F>	 Ack, LGTM.
[00:48:55] <James_F>	 OK, dduvall, want to re-roll the train?
[00:48:59] <dduvall>	 yep yep
[00:49:03] <James_F>	 Whee.
[00:49:03] <dduvall>	 group1 here we go
[00:49:05] <dancy>	 whew!
[00:49:14] <dancy>	 Nice work everyone
[00:49:23] <dancy>	 I could only watch but it was an adventure
[00:49:23] <wikibugs>	 (03PS1) 10Dduvall: group1 wikis to 1.38.0-wmf.17  refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754051
[00:49:26] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.38.0-wmf.17  refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754051 (owner: 10Dduvall)
[00:49:44] <Krinkle>	 but also, with fpm restarts, we'll have practically atomic deploys (short of a previous deploy never having compiled one of the files changing in the next deploy and first reading it off disk after the sync before the restart)
[00:49:57] <James_F>	 It's lovely how the wind ominously whistles around the eaves of my building, as the temperature falls towards -10C, in the darkest of nights, but this feels like a success anyway.
[00:50:05] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.17  refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754051 (owner: 10Dduvall)
[00:50:41] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:51:19] <Krinkle>	 ori: not even a "but when I drink, it's liquid stroopwafels". It'd make for a fun story, but no, no liquid stroopwafels, liquid cheese or other liquid forms of dutch food ite<br> actually, no I'd have liquid licorice and in fact have had that with alcohol, it's pretty good.
[00:51:39] <logmsgbot>	 !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.17  refs T293958
[00:51:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:51:42] <stashbot>	 T293958: 1.38.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T293958
[00:51:46] <brennen>	 James_F: from this, i gather you made it safely to new york. :)
[00:52:06] <James_F>	 brennen: Oh, yes, hello from NYC to you too.
[00:52:16] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/753942 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan)
[00:52:32] <logmsgbot>	 !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.17  refs T293958 (duration: 00m 52s)
[00:52:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:53:01] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:53:24] <legoktm>	 https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/maintenance/recountCategories.php
[00:53:46] <James_F>	 legoktm: You think we should run it on enwiki's CSD cats?
[00:54:21] <legoktm>	 just everywhere I think
[00:54:31] <dduvall>	 all looks well after group1 promotion. do y'all want to verify on a group1 wiki as well or should i take us to all wikis?
[00:55:01] <legoktm>	 I don't think I have admin privs on any group1 wiki, so I can't help with that. Don't really think it's required though
[00:55:13] <dduvall>	 all looks well === no troubling errors
[00:55:26] <dduvall>	 alright. all wikis here we go
[00:55:39] <wikibugs>	 (03PS1) 10Dduvall: all wikis to 1.38.0-wmf.17  refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754053
[00:55:41] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] all wikis to 1.38.0-wmf.17  refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754053 (owner: 10Dduvall)
[00:55:42] <legoktm>	 James_F: probably hack it to run on the categories we care about (CSD, etc.) first and then let it run in the background everywhere
[00:56:24] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.17  refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754053 (owner: 10Dduvall)
[00:56:49] <James_F>	 Yeah…
[00:57:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[00:57:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:57:59] <logmsgbot>	 !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.17  refs T293958
[00:58:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:58:02] <stashbot>	 T293958: 1.38.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T293958
[00:58:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[00:58:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[00:58:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:58:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:59:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[00:59:24] <AntiComposite>	 legoktm, do you need me to test something on commons?
[00:59:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:59:42] <James_F>	 AntiComposite: We're live everywhere now, so no, enwiki should be "working" again (but the counts will be wrong).
[01:00:30] <wikibugs>	 (03CR) 10Ottomata: kafka: add check to test the Broker's TLS port (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey)
[01:01:00] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Absent network_flows_internal druid jobs [puppet] - 10https://gerrit.wikimedia.org/r/753818 (https://phabricator.wikimedia.org/T263277) (owner: 10Joal)
[01:01:12] <legoktm>	 hm, the recount script seems pretty fast
[01:01:54] <legoktm>	 https://phabricator.wikimedia.org/P18738
[01:01:59] <James_F>	 legoktm: Does it work though?
[01:02:14] <legoktm>	 I have no idea :<
[01:02:21] <James_F>	 Meh.
[01:03:26] <legoktm>	 "The script runs reasonably quickly on all but the very largest wikis."
[01:03:30] <legoktm>	 thanks tto :D
[01:03:42] <James_F>	 And yet 'the very largest wikis' are where this is most needed.
[01:04:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[01:04:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:04:51] <dduvall>	 James_F, legoktm: thanks a ton for fixing things. do you need me to stick around?
[01:04:53] <legoktm>	 !log starting recountCategories.php --mode pages --wiki enwiki on mwmaint1002
[01:04:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:05:05] <legoktm>	 dduvall: I think we're set, thank you :))
[01:05:26] <James_F>	 dduvall: Thanks!
[01:05:42] <dduvall>	 right on :) i will go join the family for dinner then. break a leg!
[01:05:52] <James_F>	 dduvall: See you around.
[01:06:28] <legoktm>	 max(cat_id) on enwp is 248,681,592
[01:06:44] <legoktm>	 it's already at 222M
[01:07:20] <urbanecm>	 is it actually doing anything legoktm ? Sounds too fast to be true to me :)
[01:08:03] <James_F>	 If something seems too good to be true it probably is, indeed.
[01:08:53] <legoktm>	 it finished
[01:08:58] <legoktm>	 counts on https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_deletion are still off
[01:09:43] <legoktm>	 running with --mode subcats and then --mode files just in case those are the remaining issues
[01:10:45] <legoktm>	 nope
[01:10:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[01:10:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[01:10:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:10:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:12:07] <James_F>	 legoktm: I guess we'll need to run the real script instead?
[01:12:21] <legoktm>	 I think I got it, the categorylinks table has bogus entries still
[01:12:30] <legoktm>	 so until that's cleared, the recount script is just...counting them still
[01:13:10] <James_F>	 Anything with cl_from=0 is definitely wrong and can just be DELETE FROM'ed, right?
[01:13:54] <legoktm>	 kind of
[01:14:09] <legoktm>	 https://phabricator.wikimedia.org/P18739
[01:14:35] <James_F>	 Ah, right, page_id is the issue of course.
[01:14:37] <James_F>	 Hmm.
[01:14:52] <legoktm>	 so something like DELETE FROM categorylinks LEFT JOIN page ON cl_from=page_id WHERE page_id IS NULL;
[01:16:00] <legoktm>	 no one else complained about other links tables right, just categorylinks?
[01:16:24] <James_F>	 Umher's comment suggested other tables might be wrong too.
[01:17:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[01:17:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:17:06] <legoktm>	 ah right
[01:17:09] <legoktm>	 ok
[01:18:15] <legoktm>	 testwp has no page_id is NULL so I'm testing my deletion query on mw.o
[01:19:50] <legoktm>	 https://www.mediawiki.org/wiki/User:Legoktm/sandbox2 shows the wrong counts
[01:19:51] <James_F>	 If page_id is NULL and you're joining on cl_from=page_id then won't it just be cl_from=null?
[01:20:00] <legoktm>	 it's a left join
[01:20:07] <James_F>	 Oh duh.
[01:20:17] <James_F>	 Clearly it's too late for me to be doing this. ;_)
[01:20:21] <legoktm>	 so I'm going to try, on mw.o
[01:20:40] <James_F>	 categoryUpdate.php?
[01:21:02] <legoktm>	 DELETE FROM categorylinks LEFT JOIN page ON cl_from=page_id WHERE page_id IS NULL AND cl_to="Candidates_for_deletion" LIMIT 3;
[01:21:33] <legoktm>	 the cl_to= and LIMIT clauses to limit damage if I messed up
[01:21:36] <James_F>	 That should work.
[01:21:38] <James_F>	 Yes.
[01:21:55] <James_F>	 Given we can find them again via WLH on the templates it's not terminal.
[01:22:03] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=PUT https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[01:22:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[01:22:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:22:18] <legoktm>	 syntax error
[01:23:02] <James_F>	 Don't you have to do the join as an inner select and then DELETE FROM WHERE in that?
[01:23:03] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[01:23:26] <legoktm>	 https://stackoverflow.com/questions/2763206/deleting-rows-with-mysql-left-join
[01:23:34] <legoktm>	 but now it doesn't like my LIMIT
[01:23:40] <legoktm>	 wikiadmin@10.64.0.44(mediawikiwiki)> DELETE categorylinks FROM categorylinks LEFT JOIN page ON cl_from=page_id WHERE page_id IS NULL AND cl_to="Candidates_for_deletion" LIMIT 3;
[01:23:40] <legoktm>	 ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'LIMIT 3' at line 1
[01:24:42] <James_F>	 Do a SELECT FROM first?
[01:25:18] <legoktm>	 hm
[01:25:46] <legoktm>	 > The LIMIT clause places a limit on the number of rows that can be deleted. These clauses apply to single-table deletes, but not multi-table deletes.
[01:25:49] <legoktm>	 from https://dev.mysql.com/doc/refman/5.6/en/delete.html
[01:25:54] <James_F>	 Helpful.
[01:26:17] <legoktm>	 yeah let me rewrite it into a subquery
[01:27:50] <legoktm>	 DELETE FROM categorylinks WHERE cl_from IN (SELECT cl_from FROM categorylinks LEFT JOIN page ON cl_from=page_id WHERE page_id IS NULL AND cl_to="Candidates_for_deletion" LIMIT 3);
[01:28:13] <jinxer-wm>	 (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[01:28:30] <James_F>	 The inner SELECT returns nothing for me on enwiki.
[01:28:35] <legoktm>	 mediawikiwiki
[01:28:50] <James_F>	 Ack, that works.
[01:28:53] <legoktm>	 it should return 3 rows there
[01:28:57] <James_F>	 It does.
[01:29:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[01:29:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[01:29:17] <legoktm>	 ffs
[01:29:39] <legoktm>	 "This version of MariaDB doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'"
[01:30:21] <James_F>	 Well that's just unhelpful.
[01:30:35] <James_F>	 It's only MW.org; if it breaks the world it doesn't matter.
[01:30:51] <legoktm>	 well it means this approach is just broken
[01:31:01] <legoktm>	 because we need the LIMIT otherwise it won't be replag safe on enwp
[01:31:13] <legoktm>	 might as well just write a proper maint script that selects and then deletes
[01:31:26] <James_F>	 Yeah.
[01:32:12] <legoktm>	 give me a bit of time
[01:33:13] <jinxer-wm>	 (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[01:35:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[01:40:28] <legoktm>	 James_F: https://phabricator.wikimedia.org/P18740
[01:41:48] <legoktm>	 sorry, gotta take care of something IRL, brb in ~20
[01:41:49] <James_F>	 Should the delete be in the same batch or its own loop? Should we add a --dry-run option to just print out what'll get deleted?
[01:41:59] <James_F>	 Otherwise LGTM.
[01:47:06] <ori>	 I don't think this works
[01:47:14] <ori>	 $id isn't set anywhere
[01:47:50] <ori>	 you forgot a foreach
[02:12:00] <legoktm>	 uh
[02:12:06] <legoktm>	 I think it was supposed to be $toDelete
[02:13:07] <legoktm>	 I'll add a --dry-run
[02:14:44] <legoktm>	 updated https://phabricator.wikimedia.org/P18740 James_F, ori
[02:15:51] <legoktm>	 one more update, dropped beginTransaction(), will let that happen implicitly
[02:16:05] <ori>	 I'm not sure this works either. Won't it loop forever if there are more than batch size rows?
[02:16:19] <ori>	 the select will keep selecting the same rows
[02:16:47] <icinga-wm>	 RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:16:50] <ori>	 I like James_F's idea of separating the two 
[02:17:21] <ori>	 loop forever in dry run mode that is
[02:17:47] <legoktm>	 err, right
[02:18:01] <legoktm>	 that means we have to hold all the IDs in memory, right?
[02:18:17] <legoktm>	 I can just add a tracker for cl_to and add a cl_to >= $last condition
[02:18:28] <ori>	 how many are there?
[02:19:36] <legoktm>	 checking...
[02:19:55] <ori>	 anyways 
[02:19:58] <ori>	 er
[02:20:05] <ori>	 RAM won't be an issue unless there are billions
[02:20:13] <legoktm>	 fair
[02:21:09] <MdsShakil>	 I have a question, local site notices are not seen on mobile on bnwiki. What is the reason for this?
[02:21:11] <legoktm>	 is there a point in batching the select then?
[02:23:07] <legoktm>	 my plain select against enwp is still running
[02:23:19] <ori>	 yeah ok, batching makes sense then
[02:23:34] <legoktm>	 MdsShakil: I'm not sure, but you might have better luck in #wikimedia-tech or asking on https://meta.wikimedia.org/wiki/Tech
[02:24:14] <ori>	 in that case maybe have the dry run mode bail after one iteration
[02:24:20] <ori>	 given that this is a one-off script
[02:24:32] <ori>	 and just verify that the IDs it's outputting look sane
[02:24:35] <James_F>	 Also that.
[02:24:54] <legoktm>	 ok
[02:27:25] <legoktm>	 https://phabricator.wikimedia.org/P18741
[02:28:39] <legoktm>	 checking with ?curid= shows that all those page ids don't exist anymore
[02:28:46] <legoktm>	 (really weird error message too, but that's another issue)
[02:29:41] <legoktm>	 also my select even with LIMIT 500 on enwp is still going
[02:29:59] <legoktm>	 it's scanning uh, 168M rows
[02:30:14] <legoktm>	 | id   | select_type | table         | type   | possible_keys | key          | key_len | ref                          | rows      | Extra                                |
[02:30:14] <legoktm>	 +------+-------------+---------------+--------+---------------+--------------+---------+------------------------------+-----------+--------------------------------------+
[02:30:14] <legoktm>	 |    1 | SIMPLE      | categorylinks | index  | NULL          | cl_timestamp | 261     | NULL                         | 168681299 | Using index                          |
[02:30:14] <legoktm>	 |    1 | SIMPLE      | page          | eq_ref | PRIMARY       | PRIMARY      | 4       | enwiki.categorylinks.cl_from | 1         | Using where; Using index; Not exists |
[02:31:02] <legoktm>	 I think the where cl_from >= will speed it up...
[02:31:16] <legoktm>	 down to 87M rows
[02:31:30] <legoktm>	 (via explain)
[02:32:29] <icinga-wm>	 RECOVERY - ElasticSearch unassigned shard check - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[02:33:57] <legoktm>	 I'm not sure there's any faster way to do it
[02:34:05] <James_F>	 Ah well.
[02:34:05] <legoktm>	 the commons categorylinks table is even bigger
[02:34:45] <James_F>	 Does Commons need the fix?
[02:34:50] <James_F>	 Edit rate is very much lower.
[02:35:37] <legoktm>	 I wonder if we go in reverse order it'll be faster
[02:35:55] <legoktm>	 because deleted pages are more likely to be recently created
[02:36:01] <AntiComposite>	 yeah, c:CAT:CSD has miscounts 
[02:36:40] <wikibugs>	 (03PS1) 10Scardenasmolinar: Change TheWikipediaLibrary editcount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754054 (https://phabricator.wikimedia.org/T288070)
[02:37:36] <legoktm>	 ok, going backwards returns results faster
[02:38:13] <legoktm>	 my LIMIT 500 query on Commons returned 500 results
[02:38:37] <legoktm>	 oops
[02:38:51] <legoktm>	  18:29:40 <legoktm> also my select even with LIMIT 500 on enwp is still going <-- this was wrong, ignore
[02:39:08] <James_F>	 Hmm.
[02:39:26] <legoktm>	 it took ~1m to return the first 500 results
[02:39:33] <ori>	 is scanning 84m rows a lot?
[02:39:41] <ori>	 I think it's going to perform ok
[02:39:43] <legoktm>	 no, I screwed up
[02:39:44] <legoktm>	 yeah it's fine
[02:39:59] <legoktm>	 I did SELECT COUNT(*) ... LIMIT 500, which is useless
[02:40:22] <legoktm>	 with the cl_from >= it's fine
[02:40:56] <James_F>	 :-)
[02:41:00] <legoktm>	 so I think we clean up categorylinks first and then extend the script to all the other links tables
[02:41:09] <James_F>	 WFM.
[02:41:40] <legoktm>	 well, clean up categorylinks, re-run recountCategories, then the other links tables
[02:41:53] * James_F nods.
[02:42:10] <legoktm>	 ok, please review https://phabricator.wikimedia.org/P18740
[02:43:28] <legoktm>	 ori, James_F ^
[02:44:02] <James_F>	 Just wait for replication, no additional sleep?
[02:44:16] <James_F>	 It should be fine.
[02:44:33] <legoktm>	 yeah I think it should be fine
[02:44:41] <James_F>	 +2
[02:45:07] <ori>	 LGTM but do you want a DBA on hand in case we reasoned badly?
[02:46:09] <legoktm>	 we're probably 3-4 hours away from a DBA being awake I think
[02:46:17] <James_F>	 Or Monday.
[02:46:30] <legoktm>	 are you worried about the queries/deletes being expensive or deleting the wrong thing?
[02:46:37] <ori>	 "yes"
[02:46:39] <James_F>	 Well, Amir.1 will be around in a few hours probably.
[02:47:09] <ori>	 your script looks correct and I think it's probably safe, but it's the production database
[02:47:19] <James_F>	 It's secondary data, ultimately.
[02:47:33] <James_F>	 If we dropped the entire table it'd be bad but recoverable without backups.
[02:47:43] <James_F>	 (Though editors would be Unhappy™.)
[02:47:56] <jynus>	 yeah, what you want to have on speed dial is a backup guy, not a dba
[02:48:02] <legoktm>	 !!!
[02:48:11] <legoktm>	 perfect timing jynus :)
[02:48:11] <James_F>	 jynus: Oops, did we accidentally summon you?
[02:48:13] <ori>	 hahaha
[02:48:18] <ori>	 I can imagine you just woke up in your sleep
[02:48:21] <jynus>	 (I am not here, BTW)
[02:48:22] <ori>	 "something is not right"
[02:48:28] <James_F>	 Someone somewhere is doing something bad.
[02:48:33] <James_F>	 I can feel it in my fingers.
[02:49:24] <James_F>	 legoktm: Umherirrender just suggested refreshLinks.php --dfn-only
[02:49:38] <jynus>	 we have backups of the *link tables too, so don't worry
[02:49:52] <legoktm>	 amazing, of course it already existed
[02:49:57] <legoktm>	 		$this->addOption( 'dfn-only', 'Delete links from nonexistent articles only' );
[02:50:01] <James_F>	 Yup.
[02:50:10] * ori headdesks
[02:50:11] <James_F>	 Apparently it's already a cron'ed job?
[02:50:40] <legoktm>	 https://gerrit.wikimedia.org/g/operations/puppet/+/1415150baa54865bc30173de750c1a9f71ca8626/modules/profile/manifests/mediawiki/maintenance/refreshlinks/periodic_job.pp#6
[02:50:42] <James_F>	 https://gerrit.wikimedia.org/g/operations/puppet/+/1415150baa54865bc30173de750c1a9f71ca8626/modules/profile/manifests/mediawiki/maintenance/refreshlinks/periodic_job.pp
[02:50:44] <James_F>	 Snap.
[02:50:49] <legoktm>	 ok, this is totally safe to run then
[02:50:58] <James_F>	 Yeah, let's JFDI.
[02:51:17] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:51:45] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:52:08] <legoktm>	 !log started mwscript refreshLinks.php --wiki=enwiki --dfn-only 
[02:52:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:52:56] <legoktm>	 I'll do a separate one for commons too
[02:52:59] <James_F>	 legoktm: You should mention the task in your !logs so that Phab stalkers get some info.
[02:53:08] <legoktm>	 and then the rest can be foreachwikiindblist
[02:53:20] <James_F>	 Yeah.
[02:54:14] <legoktm>	 !log started mwscript refreshLinks.php --wiki=enwiki --dfn-only (T299244)
[02:54:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:54:17] <stashbot>	 T299244: {{PAGESINCATEGORY:Wikipedia:Nuweg}} not decreased when page is deleted - https://phabricator.wikimedia.org/T299244
[02:54:44] <James_F>	 Ta
[02:55:10] <legoktm>	 !log started mwscript refreshLinks.php --wiki=commonswiki --dfn-only (T299244)
[02:55:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:56:56] <legoktm>	 for the rest I'm just going to start the systemd timers
[02:57:21] <ori>	 if you can do it without googling you get a medal
[02:57:52] <jynus>	 I am trying to think if there would be any fancy and general way to revert changes in an easy way- a clearer event-driven system? An append only storage model?
[02:59:13] <icinga-wm>	 RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:59:21] <legoktm>	 ori: "sudo systemctl start .."? :)
[02:59:57] <jynus>	 something I want to try is system versioning tables: https://mariadb.com/kb/en/system-versioned-tables/ but I suspect a db will explode if we use that on a large production db
[03:01:11] <ori>	 BLOCKCHAIN
[03:01:32] <legoktm>	 !log started refreshLinks --dfn-only via systemd units for s2-s6 (T299244)
[03:01:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:01:36] <stashbot>	 T299244: {{PAGESINCATEGORY:Wikipedia:Nuweg}} not decreased when page is deleted - https://phabricator.wikimedia.org/T299244
[03:01:52] <legoktm>	 I'm watching https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&refresh=1m&var-site=eqiad&var-group=core&var-shard=s1&var-shard=s2&var-role=All
[03:01:55] <jynus>	 I mention it because I think Tim asked for a delayed replica recently, and that could provide a "continuous delayed replica"
[03:02:25] <legoktm>	 there's a rows read spike but that could just be normal traffic(?)
[03:02:26] <jynus>	 the problem, as usual, is how to integrate old and new data for a recover- that I think is the biggest limitation of db recoveries
[03:04:03] <legoktm>	 in theory *links tables all secondary/derivative data, but I get the feeling it would take much longer to reparse every page to rebuild it vs restore from a backup
[03:04:37] <James_F>	 Can you imagine re-building the Wikidata links tables from scratch?
[03:05:07] <jynus>	 the problem with a recovery is that by the time you do it, new data has been added, and it is non trivial to solve that issue
[03:05:14] <James_F>	 Yeah.
[03:05:26] <legoktm>	 !log started refreshLinks --dfn-only via systemd units for s7-s8 (T299244)
[03:05:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:05:42] <James_F>	 We don't want to have the 3-day-downtime whilst we manually reconcile the binlogs of the DBs from back in the Tampa days again.
[03:05:58] <jynus>	 at least with the current model, that is way maybe a future simpler model could do "automatic merges", but that is is the far future
[03:06:14] <jynus>	 *there
[03:06:30] <James_F>	 legoktm: OK, can I slope off or do you need a second pair of eyes around still?
[03:06:36] <legoktm>	 nope, go
[03:06:42] <legoktm>	 I was about to bail in a few minutes, just updating the task
[03:06:43] <James_F>	 Awesome. Thanks for all your help.
[03:06:46] <James_F>	 Ack.
[03:06:47] <legoktm>	 <3
[03:06:55] <legoktm>	 this was fun, let's not do it again for a few months
[03:07:01] <jynus>	 thanks to everyone that helped!
[03:07:03] <James_F>	 s/months/years/
[03:07:09] * ori waves
[03:09:11] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:18:33] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:28:13] <jinxer-wm>	 (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[03:33:13] <jinxer-wm>	 (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[05:20:39] <legoktm>	 !log started recountCategories.php --wiki=enwiki --mode pages (T299244)
[05:20:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:20:43] <stashbot>	 T299244: {{PAGESINCATEGORY:Wikipedia:Nuweg}} not decreased when page is deleted - https://phabricator.wikimedia.org/T299244
[05:20:59] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:26:13] <jinxer-wm>	 (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[05:27:30] <legoktm>	 CAT:CSD looks good on enwp 
[05:34:19] <wikibugs>	 10SRE, 10Language-Team (Language-2022-January-March): Deploy Flores MT secrets in Production for ContentTranslation - https://phabricator.wikimedia.org/T299023 (10KartikMistry) >>! In T299023#7623070, @Dzahn wrote: > Of course using GPG is fine as well. I just did not suggest it because usually people consider...
[05:36:13] <jinxer-wm>	 (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[05:53:49] <icinga-wm>	 PROBLEM - SSH on restbase2011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:13:23] <logmsgbot>	 test
[06:14:46] <legoktm>	 !log running recountCategories on s3 wikis
[06:14:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:43] <logmsgbot>	 !log <legoktm> finished running recountCategories on s8 wikis (T299244)
[06:18:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:47] <stashbot>	 T299244: Deleted pages are not being removed fron links tables, which also messes up category counts - https://phabricator.wikimedia.org/T299244
[06:19:07] <legoktm>	 ...
[06:19:11] <legoktm>	 I didn't even get to log that I started it
[06:19:38] <logmsgbot>	 !log <legoktm> finished running recountCategories on s5 wikis (T299244)
[06:19:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:20:35] <legoktm>	 I guess the <brackets> were too much
[06:21:46] <logmsgbot>	 !log <legoktm> finished running recountCategories on s6 wikis (T299244)
[06:21:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:41:50] <logmsgbot>	 !log <legoktm> finished running recountCategories on s3 wikis (T299244)
[06:41:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:41:54] <stashbot>	 T299244: Deleted pages are not being removed fron links tables, which also messes up category counts - https://phabricator.wikimedia.org/T299244
[07:27:13] <jinxer-wm>	 (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[07:32:13] <jinxer-wm>	 (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[07:51:28] <logmsgbot>	 !log legoktm finished running recountCategories on s2 wikis (T299244)
[07:51:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:33] <stashbot>	 T299244: Deleted pages are not being removed fron links tables, which also messes up category counts - https://phabricator.wikimedia.org/T299244
[07:56:15] <icinga-wm>	 RECOVERY - SSH on restbase2011.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:58:57] <logmsgbot>	 !log legoktm finished running recountCategories on s7 wikis (T299244)
[07:59:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:01] <stashbot>	 T299244: Deleted pages are not being removed fron links tables, which also messes up category counts - https://phabricator.wikimedia.org/T299244
[07:59:44] <legoktm>	 ^ that's the last script, everything in the DB should be back to normal now
[08:00:25] <legoktm>	 oh wait, I missed s4
[08:01:24] <legoktm>	 started
[08:02:23] <icinga-wm>	 PROBLEM - SSH on restbase2010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:51:18] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: build: add mypy types [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747104 (owner: 10Hashar)
[08:51:53] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org
[08:51:53] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org
[08:52:34] <XioNoX>	 on my phone
[08:52:43] <XioNoX>	 looks like equinix in ulsfo
[08:55:37] <godog_>	 I'm here too, checking
[08:55:53] <logmsgbot>	 !log legoktm finished running recountCategories on s4 wikis (T299244)
[08:55:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:57] <stashbot>	 T299244: Deleted pages are not being removed fron links tables, which also messes up category counts - https://phabricator.wikimedia.org/T299244
[08:56:56] <XioNoX>	 still trying to figure out to which network it is
[08:57:38] <godog>	 XioNoX: anything I can do to assist ?
[08:59:11] * akosiaris around
[08:59:27] <akosiaris>	 how can I help? what do we know already?
[08:59:27] <XioNoX>	 not sure why it's not showing up in netflow
[08:59:57] <XioNoX>	 if you can help grepping logs
[09:00:21] <XioNoX>	 to figure out what requests, most likely upload are causing the spike
[09:00:36] <godog>	 yeah it is upload, checking logs
[09:00:41] <XioNoX>	 and filter out their IPs or UA
[09:00:53] <XioNoX>	 I can't do that from my phone
[09:01:04] <akosiaris>	 ok, will do. 
[09:01:13] <akosiaris>	 godog: is it just ulsfo upload? or more ?
[09:01:40] <godog>	 akosiaris: good question, afaict ulsfo upload for now https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=ulsfo&var-cluster=cache_upload&var-instance=All&var-datasource=thanos
[09:02:05] <XioNoX>	 akosiaris: it looks like ulsfo only, and kinda only upload can cause such big spike
[09:04:43] <akosiaris>	 yeah, double checked it as well. it's ulsfo only. even codfw isn't seeing much.
[09:07:03] <godog>	 still no joy, but still looking
[09:12:40] <akosiaris>	 same here
[09:19:21] <wikibugs>	 (03PS1) 10Filippo Giunchedi: varnish: temp ban Python-urllib/3.8 [puppet] - 10https://gerrit.wikimedia.org/r/754060
[09:27:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] varnish: temp ban Python-urllib/3.8 [puppet] - 10https://gerrit.wikimedia.org/r/754060 (owner: 10Filippo Giunchedi)
[09:31:53] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org
[09:31:53] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org
[10:25:30] <wikibugs>	 (03PS10) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for I9b40319d374143668a2666b42f59a3799d041afc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T298308)
[10:25:55] <wikibugs>	 (03CR) 10Winston Sung: [C: 03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T298308) (owner: 10Winston Sung)
[11:00:19] <icinga-wm>	 PROBLEM - SSH on restbase2011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:06:21] <icinga-wm>	 RECOVERY - SSH on restbase2010.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:01:35] <icinga-wm>	 RECOVERY - SSH on restbase2011.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:03:16] <wikibugs>	 (03PS1) 10Jelto: gitlab: update cloud hiera, refactor naming [puppet] - 10https://gerrit.wikimedia.org/r/754063 (https://phabricator.wikimedia.org/T297411)
[12:08:39] <wikibugs>	 (03PS2) 10Jelto: gitlab: update cloud hiera, refactor naming [puppet] - 10https://gerrit.wikimedia.org/r/754063 (https://phabricator.wikimedia.org/T297411)
[12:11:02] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33265/console" [puppet] - 10https://gerrit.wikimedia.org/r/754063 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto)
[12:18:13] <jinxer-wm>	 (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[12:30:21] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:38:13] <jinxer-wm>	 (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[13:30:06] <icinga-wm>	 PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:55:47] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:56:53] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:09:57] <icinga-wm>	 PROBLEM - SSH on restbase2010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:11:15] <icinga-wm>	 RECOVERY - SSH on restbase2010.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:32:37] <icinga-wm>	 RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:45:07] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] build: add mypy types [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747104 (owner: 10Hashar)
[16:46:38] <wikibugs>	 (03Merged) 10jenkins-bot: build: add mypy types [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747104 (owner: 10Hashar)
[17:00:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "I like the idea of throwing errors on undefined properties, but I would probably remove the changes to seed_image, as I plan to remove it " [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747060 (https://phabricator.wikimedia.org/T297619) (owner: 10Hashar)
[17:00:08] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] Be strict on undefined variables such as seed_image [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747060 (https://phabricator.wikimedia.org/T297619) (owner: 10Hashar)
[18:57:59] <icinga-wm>	 PROBLEM - Disk space on ml-etcd2002 is CRITICAL: DISK CRITICAL - free space: / 717 MB (3% inode=95%): /tmp 717 MB (3% inode=95%): /var/tmp 717 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ml-etcd2002&var-datasource=codfw+prometheus/ops
[19:00:45] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:11:33] <icinga-wm>	 PROBLEM - SSH on restbase2011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:03:25] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:12:51] <icinga-wm>	 RECOVERY - SSH on restbase2011.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:42:13] <icinga-wm>	 PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:28:07] <icinga-wm>	 PROBLEM - SSH on ms-fe2008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:59:43] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 48.48 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1