[00:00:15] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:59] I skimmed https://www.mediawiki.org/wiki/MediaWiki_1.38/wmf.17, seems fine [00:01:00] legoktm: thanks for the update. i see twentyafterfour stepped in as well. <3 [00:01:55] I feel comfortable that the revert is likely fine, but I haven't been paying attention for the past two weeks, so I'm more at 75% confidence [00:02:32] That's probably sufficient. [00:02:48] alright. let's deploy the reverts. if anything else goes wrong or it's not fixed, let's roll the train back and wait until tuesday [00:02:54] WFM. [00:02:57] i'm around as well. i am... not super plugged into what's been going on, but i can watch logs for a bit. [00:03:00] dduvall: Are you doing or should I? [00:03:06] James_F: go for it :) [00:03:10] Whee. [00:03:20] (03CR) 10Jforrester: [C: 03+2] "Emergency deploying." [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754046 (https://phabricator.wikimedia.org/T299244) (owner: 10Legoktm) [00:04:18] (There goes my personal 'helping other teams' budget for the week, oh well.) [00:04:20] sounds good, while jenkins runs I'll grab some snacks [00:05:05] and I'll take it from you legoktm :P [00:05:44] This is a pretty broad commit set. Is scapping it going to actually work, or will it choke on the canaries? [00:06:31] If not, should we first rollback the train from group2, then deploy it, then re-roll? [00:06:34] (Gah.) [00:06:55] I think if you sync includes/deferred/LinksUpdate.php first, and then sync all it'll cut down on most errors [00:06:56] James_F: it changes namespacing of class AFAICS. It will very likely throw a lot on canaries [00:07:19] oh, bleh [00:07:23] legoktm: Won't it clash in the autoloader? [00:07:24] Yeah. [00:07:40] dduvall: Rollback of the train is now a single command, right? [00:07:40] well, at least it won't be trying to load a deleted file? :) [00:07:41] (03CR) 10Krinkle: [C: 03+1] Revert "LinksUpdate refactor" and follow-ups [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754046 (https://phabricator.wikimedia.org/T299244) (owner: 10Legoktm) [00:08:08] personally I'd rather we swallow the exceptions rather than mess with moving the train back and forth [00:08:24] most of this runs during the job queue so it won't be as user facing and any failures will be retried [00:08:25] A force-deploy scap of mediawiki/includes? [00:08:44] But the job-injection code will also throw errors. [00:09:09] it's not a whole lot of work to rollback the train. two commands [00:09:13] if you want me to do that [00:09:14] ack. [00:09:18] legoktm, no need to unnecessary shortcuts. [00:09:21] what dduvall said. [00:09:27] dduvall: Could you? [00:09:30] job inject is in deferred updates afaik [00:09:33] sure thing [00:10:08] James_F: ugh, right. [00:10:20] legoktm: Isn't everything terrible? [00:10:21] Yeah, I guess moving back will lessen the impact [00:10:29] All the way to group0? [00:10:30] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/754047/ was just pushed to gerrit, claiming to fix the bug [00:10:44] was just going to ask. rollback group1 as well? [00:11:07] dduvall: … yeah, let's. [00:11:28] legoktm, Krinkle see taavi's message there ^ [00:11:31] OTOH, if we're temporarily rolling back anyway, do we also want to deploy this and re-roll, or should we just leave it? [00:12:00] taavi: I suspect Umherirrender's analysis and patch is correct, but I'm not super comfortable leaving the LinksUpdate code around even with that given the amount of regressions we've seen so far [00:12:00] Umherirrender's patch makes sense to me but I can't test locally. Can someone? [00:12:01] I'd rather give Tim time to look over the proposed fix and some of the other aspects in the task [00:12:09] +1 [00:12:10] ^^ [00:12:24] I'm for rollback just the refactor. It's a drop in replacement and forward/back compat, should be fine [00:12:49] legoktm: makes sense [00:13:30] as for atomic deploy, I think it's unfortunate that we still haven't either made scap use git fetch/checkout, nor have we enabled fpm graceful restarts with revalidation disabled, which indeed means deploying anything non-trivial is always ungraceful even without --force. [00:14:11] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: Revert "all/group1 wikis to 1.38.0-wmf.17" [00:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:23] yeah, sigh. `scap deploy` uses git but sadly we never migration mediawiki to use it [00:15:02] Obviously mw-on-k8s will make things much better, but with a few months' work we could have got from bad to good before we get to great. Oh well. [00:15:11] yep [00:15:31] perfect is often the enemy of good [00:15:40] or better [00:15:43] There's no logstash key for the cl_from=0 right? It's just silently failing to update anything meaningful. [00:16:12] (Perhaps the DB layer should throw when the page ID pointer is a nonsense value, but that's another thing for the post-incident review.) [00:16:37] k. James_F, legoktm: group1/group2 are on wmf.16 now [00:16:40] Ack. [00:16:42] Thanks dduvall [00:17:01] * James_F glares at zuul. [00:17:02] I see there have been no config pathces since the wmf.17 promotion. If that hadn't been the case, I'd say it's more risky to use the train as a clever trick to avoid errors. I'd say --force would have been normal, not a shortcut, given that normal means work always, and using the train doesn't always work, but in this case it seems like the best possible option indeed :) [00:17:22] I'm offline for the night, thanks everyone [00:17:28] thanks for all your help taavi [00:17:37] gn8 taavi [00:17:39] o/ [00:17:50] are we anticipating moving back to group1/group2 imminently? if so, i'll leave the reverts in /srv/mediawiki-staging and discard when it's time to move back [00:18:04] dduvall: Yes. [00:18:07] k [00:18:10] I think as soon as James_F syncs it we should move forward again [00:18:15] ack [00:18:16] Well. [00:18:23] First we might want to check that things work on group0. ;-) [00:18:25] with some breathing room :) [00:18:29] * James_F grins. [00:22:41] OK, we're in the final furlong of the CI run. [00:23:53] (03Merged) 10jenkins-bot: Revert "LinksUpdate refactor" and follow-ups [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754046 (https://phabricator.wikimedia.org/T299244) (owner: 10Legoktm) [00:26:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:01] (03PS1) 10Catrope: doc.wikimedia.org CSP: Allow XHR requests to Wikipedia and Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/754048 (https://phabricator.wikimedia.org/T285570) [00:27:09] Krinkle, noted reg. config & shortcuts. thanks for the explanation! [00:27:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:27:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:33] (03CR) 10Catrope: [C: 04-1] "Please do not merge without review and approval from the Security team" [puppet] - 10https://gerrit.wikimedia.org/r/754048 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope) [00:27:52] I suppose we can test using mw-on-k8s now :) [00:28:30] OK, this is now live on mwdebug1002. [00:28:35] Anyone else is welcome to test too. [00:28:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:56] (But only on group0.) [00:29:29] Of course the jobs won't go through the debug server, but at least the triggering of them can be tested. [00:30:36] on test.wp, I created [[What even]] with contents [[Category:What]], saw it appear, then deleted the page and saw it disappear from the category [00:31:24] Hmmmmm. [00:31:46] I just deleted a page there and on deletion I got the message "The page or file ‘Why’ could not be deleted. It may have already been deleted by someone else." with a deletion log of me deleting it. [00:31:55] Maybe I accidentally double-clicked delete? [00:32:02] links update, and deletedlinksupdate run in deferred post-send as first attempt [00:32:06] become jobs if they fail. [00:32:13] (+ cascading updates > jobs) [00:32:15] Krinkle: Oh, right, that explains them working on testwiki. [00:32:16] Ack. [00:32:31] and {{PAGESINCATEGORY:...}} has the correct value [00:33:20] Not right now it doesn't. [00:33:34] Pages in category: 0 | Pages in category ‘What’ This category contains only the following page. W Why not [00:33:48] Eurgh. [00:33:59] MW-SNAFU of category counting? [00:34:31] hrm [00:35:08] no, I screwed up the syntax [00:35:25] it's "PAGESINCATEGORY:What" not "...Category:What" [00:35:40] Ha. [00:36:04] Yeah, I got 'The page or file ‘Why not’ could not be deleted. It may have already been deleted by someone else.' again. [00:36:06] But it deleted. [00:36:19] And the category count and membership updated correctly. [00:36:29] OK, let's sync this. Agreed, legoktm, Krinkle? [00:36:33] +1 [00:36:37] Ack. [00:37:02] Should I just do a real sync-world? [00:37:21] (03PS1) 10Dduvall: Revert "all wikis to 1.38.0-wmf.17 refs T293958" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754049 [00:37:23] (03CR) 10Dduvall: [C: 03+2] Revert "all wikis to 1.38.0-wmf.17 refs T293958" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754049 (owner: 10Dduvall) [00:37:25] (03PS1) 10Dduvall: Revert "group1 wikis to 1.38.0-wmf.17 refs T293958" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754050 [00:37:27] (03CR) 10Dduvall: [C: 03+2] Revert "group1 wikis to 1.38.0-wmf.17 refs T293958" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754050 (owner: 10Dduvall) [00:37:29] don't mind ^ [00:37:29] dduvall: Wait, not yet. [00:37:38] just the reverts [00:37:40] Oh, right, that's just clean-up, never mind. :-) [00:37:41] Yeah. [00:38:01] i realized that if we're going back just to group1 i'd have to fiddle with the git HEAD and i decided not to do that [00:38:09] (03Merged) 10jenkins-bot: Revert "all wikis to 1.38.0-wmf.17 refs T293958" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754049 (owner: 10Dduvall) [00:38:13] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.38.0-wmf.17 refs T293958" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754050 (owner: 10Dduvall) [00:38:22] * James_F nods. [00:38:50] dduvall: Do you think a scap sync-world is better than syncing includes/ and then the autoloader and then other things and hoping the order is roughly right? [00:39:06] (My kingdom for atomic deploys.) [00:39:51] without knowing the details here, i can't really say [00:40:06] * James_F nods. [00:40:08] my guess is that either is ok since we're still on group0 [00:40:13] True. [00:40:17] Let's do the safe thing then. [00:40:35] I'd personally try sync-file and if it fails do sync-world [00:41:06] I'd be nervous of that getting into a messy situation. [00:41:18] I guess we could sync-file --force ? [00:41:23] But I'd really rather avoid. [00:41:59] sync-world shouldn't be _that_ slow with everything already out [00:41:59] I don't think it would be that messy, but up to you [00:42:39] !log jforrester@deploy1002 Started scap: Revert "LinksUpdate refactor" and follow-ups for T299244 re. T293958 [00:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:44] T299244: {{PAGESINCATEGORY:Wikipedia:Nuweg}} not decreased when page is deleted - https://phabricator.wikimedia.org/T299244 [00:42:44] T293958: 1.38.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T293958 [00:42:45] Going with a full scap. [00:43:56] btw, what makes you suggest --force? [00:44:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:42] dancy: because sync isn't atomic and there's class name changes and autoloader changes etc. [00:44:46] dancy: If the autoloader is referring to a file and then we sync the includes/ directory such that some requests will hit a file that no longer exists, I've run into the state of the initial sync passing the canaries but subsequent ones always failing as jobs get retried. [00:44:59] dancy: Aka this is why I would drink if I did. ;-) [00:45:15] so aspect of --force are you looking for? Unconditional php-fpm restart? [00:45:19] *What aspect [00:45:22] * Krinkle looks at a bottle of "stroomwafel liqour" on his kitchen counter [00:45:32] dancy: bypassing canary failure i believe [00:45:34] The "ignore what the canaries say, I really need this change synced to the machines". [00:45:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:45:49] Ah, I didn't realize the canaries were problematic. [00:45:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:57] (and no James_F , I didn't buy it. I rarely drink.) [00:45:58] OK, the scap sync-world has done everything except the cdb-rebuild. [00:45:59] Good times. [00:46:37] !log jforrester@deploy1002 Finished scap: Revert "LinksUpdate refactor" and follow-ups for T299244 re. T293958 (duration: 03m 58s) [00:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:44] And we're done. Fastest serious sync-world in history? [00:46:51] OK, time to test again. [00:46:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:04] wow, fast indeed. [00:47:13] Krinkle: omg, it's actually a thing: https://www.totalwine.com/spirits/liqueurscordialsschnapps/chocolate-sweets-candy/caramel/van-meers-stroopwafel-liqueur/ [00:47:18] And much less risk of James having a heart attack. [00:47:19] yeah, it's really not that bad. we've talked about making everyone use it all the time :) [00:47:21] I have spent some time shaving seconds of sync-world time over the last few montsh [00:47:24] *months [00:47:27] ori, Krinkle: Good grief. [00:47:30] ori: that's the one [00:47:33] RECOVERY - SSH on restbase2011.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:47:38] dancy: You sir are a gent amongst humans. [00:47:41] that's right. thanks to dancy [00:48:07] Krinkle: that sounds a little too delicious [00:48:19] like, i'd need to clear my calendar [00:48:20] of course, soon we'll turn php-fpm restart on and all syncs will take 3 minutes. Enjoy it while it lasts. [00:48:35] :-) [00:48:37] I restored and re-deleted [[What even]] and the category page updated properly, showing it and then it disappearing [00:48:48] \o/ [00:48:49] Ack, LGTM. [00:48:55] OK, dduvall, want to re-roll the train? [00:48:59] yep yep [00:49:03] Whee. [00:49:03] group1 here we go [00:49:05] whew! [00:49:14] Nice work everyone [00:49:23] I could only watch but it was an adventure [00:49:23] (03PS1) 10Dduvall: group1 wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754051 [00:49:26] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754051 (owner: 10Dduvall) [00:49:44] but also, with fpm restarts, we'll have practically atomic deploys (short of a previous deploy never having compiled one of the files changing in the next deploy and first reading it off disk after the sync before the restart) [00:49:57] It's lovely how the wind ominously whistles around the eaves of my building, as the temperature falls towards -10C, in the darkest of nights, but this feels like a success anyway. [00:50:05] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754051 (owner: 10Dduvall) [00:50:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:51:19] ori: not even a "but when I drink, it's liquid stroopwafels". It'd make for a fun story, but no, no liquid stroopwafels, liquid cheese or other liquid forms of dutch food ite
actually, no I'd have liquid licorice and in fact have had that with alcohol, it's pretty good. [00:51:39] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.17 refs T293958 [00:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:42] T293958: 1.38.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T293958 [00:51:46] James_F: from this, i gather you made it safely to new york. :) [00:52:06] brennen: Oh, yes, hello from NYC to you too. [00:52:16] (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/753942 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [00:52:32] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.17 refs T293958 (duration: 00m 52s) [00:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:53:24] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/maintenance/recountCategories.php [00:53:46] legoktm: You think we should run it on enwiki's CSD cats? [00:54:21] just everywhere I think [00:54:31] all looks well after group1 promotion. do y'all want to verify on a group1 wiki as well or should i take us to all wikis? [00:55:01] I don't think I have admin privs on any group1 wiki, so I can't help with that. Don't really think it's required though [00:55:13] all looks well === no troubling errors [00:55:26] alright. all wikis here we go [00:55:39] (03PS1) 10Dduvall: all wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754053 [00:55:41] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754053 (owner: 10Dduvall) [00:55:42] James_F: probably hack it to run on the categories we care about (CSD, etc.) first and then let it run in the background everywhere [00:56:24] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754053 (owner: 10Dduvall) [00:56:49] Yeah… [00:57:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:59] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.17 refs T293958 [00:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:02] T293958: 1.38.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T293958 [00:58:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:58:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:59:24] legoktm, do you need me to test something on commons? [00:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:42] AntiComposite: We're live everywhere now, so no, enwiki should be "working" again (but the counts will be wrong). [01:00:30] (03CR) 10Ottomata: kafka: add check to test the Broker's TLS port (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [01:01:00] (03CR) 10Ottomata: [C: 03+2] Absent network_flows_internal druid jobs [puppet] - 10https://gerrit.wikimedia.org/r/753818 (https://phabricator.wikimedia.org/T263277) (owner: 10Joal) [01:01:12] hm, the recount script seems pretty fast [01:01:54] https://phabricator.wikimedia.org/P18738 [01:01:59] legoktm: Does it work though? [01:02:14] I have no idea :< [01:02:21] Meh. [01:03:26] "The script runs reasonably quickly on all but the very largest wikis." [01:03:30] thanks tto :D [01:03:42] And yet 'the very largest wikis' are where this is most needed. [01:04:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [01:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:51] James_F, legoktm: thanks a ton for fixing things. do you need me to stick around? [01:04:53] !log starting recountCategories.php --mode pages --wiki enwiki on mwmaint1002 [01:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:05] dduvall: I think we're set, thank you :)) [01:05:26] dduvall: Thanks! [01:05:42] right on :) i will go join the family for dinner then. break a leg! [01:05:52] dduvall: See you around. [01:06:28] max(cat_id) on enwp is 248,681,592 [01:06:44] it's already at 222M [01:07:20] is it actually doing anything legoktm ? Sounds too fast to be true to me :) [01:08:03] If something seems too good to be true it probably is, indeed. [01:08:53] it finished [01:08:58] counts on https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_deletion are still off [01:09:43] running with --mode subcats and then --mode files just in case those are the remaining issues [01:10:45] nope [01:10:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [01:10:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [01:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:07] legoktm: I guess we'll need to run the real script instead? [01:12:21] I think I got it, the categorylinks table has bogus entries still [01:12:30] so until that's cleared, the recount script is just...counting them still [01:13:10] Anything with cl_from=0 is definitely wrong and can just be DELETE FROM'ed, right? [01:13:54] kind of [01:14:09] https://phabricator.wikimedia.org/P18739 [01:14:35] Ah, right, page_id is the issue of course. [01:14:37] Hmm. [01:14:52] so something like DELETE FROM categorylinks LEFT JOIN page ON cl_from=page_id WHERE page_id IS NULL; [01:16:00] no one else complained about other links tables right, just categorylinks? [01:16:24] Umher's comment suggested other tables might be wrong too. [01:17:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [01:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:06] ah right [01:17:09] ok [01:18:15] testwp has no page_id is NULL so I'm testing my deletion query on mw.o [01:19:50] https://www.mediawiki.org/wiki/User:Legoktm/sandbox2 shows the wrong counts [01:19:51] If page_id is NULL and you're joining on cl_from=page_id then won't it just be cl_from=null? [01:20:00] it's a left join [01:20:07] Oh duh. [01:20:17] Clearly it's too late for me to be doing this. ;_) [01:20:21] so I'm going to try, on mw.o [01:20:40] categoryUpdate.php? [01:21:02] DELETE FROM categorylinks LEFT JOIN page ON cl_from=page_id WHERE page_id IS NULL AND cl_to="Candidates_for_deletion" LIMIT 3; [01:21:33] the cl_to= and LIMIT clauses to limit damage if I messed up [01:21:36] That should work. [01:21:38] Yes. [01:21:55] Given we can find them again via WLH on the templates it's not terminal. [01:22:03] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=PUT https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [01:22:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [01:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:18] syntax error [01:23:02] Don't you have to do the join as an inner select and then DELETE FROM WHERE in that? [01:23:03] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [01:23:26] https://stackoverflow.com/questions/2763206/deleting-rows-with-mysql-left-join [01:23:34] but now it doesn't like my LIMIT [01:23:40] wikiadmin@10.64.0.44(mediawikiwiki)> DELETE categorylinks FROM categorylinks LEFT JOIN page ON cl_from=page_id WHERE page_id IS NULL AND cl_to="Candidates_for_deletion" LIMIT 3; [01:23:40] ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'LIMIT 3' at line 1 [01:24:42] Do a SELECT FROM first? [01:25:18] hm [01:25:46] > The LIMIT clause places a limit on the number of rows that can be deleted. These clauses apply to single-table deletes, but not multi-table deletes. [01:25:49] from https://dev.mysql.com/doc/refman/5.6/en/delete.html [01:25:54] Helpful. [01:26:17] yeah let me rewrite it into a subquery [01:27:50] DELETE FROM categorylinks WHERE cl_from IN (SELECT cl_from FROM categorylinks LEFT JOIN page ON cl_from=page_id WHERE page_id IS NULL AND cl_to="Candidates_for_deletion" LIMIT 3); [01:28:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [01:28:30] The inner SELECT returns nothing for me on enwiki. [01:28:35] mediawikiwiki [01:28:50] Ack, that works. [01:28:53] it should return 3 rows there [01:28:57] It does. [01:29:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [01:29:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [01:29:17] ffs [01:29:39] "This version of MariaDB doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'" [01:30:21] Well that's just unhelpful. [01:30:35] It's only MW.org; if it breaks the world it doesn't matter. [01:30:51] well it means this approach is just broken [01:31:01] because we need the LIMIT otherwise it won't be replag safe on enwp [01:31:13] might as well just write a proper maint script that selects and then deletes [01:31:26] Yeah. [01:32:12] give me a bit of time [01:33:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [01:35:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [01:40:28] James_F: https://phabricator.wikimedia.org/P18740 [01:41:48] sorry, gotta take care of something IRL, brb in ~20 [01:41:49] Should the delete be in the same batch or its own loop? Should we add a --dry-run option to just print out what'll get deleted? [01:41:59] Otherwise LGTM. [01:47:06] I don't think this works [01:47:14] $id isn't set anywhere [01:47:50] you forgot a foreach [02:12:00] uh [02:12:06] I think it was supposed to be $toDelete [02:13:07] I'll add a --dry-run [02:14:44] updated https://phabricator.wikimedia.org/P18740 James_F, ori [02:15:51] one more update, dropped beginTransaction(), will let that happen implicitly [02:16:05] I'm not sure this works either. Won't it loop forever if there are more than batch size rows? [02:16:19] the select will keep selecting the same rows [02:16:47] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:16:50] I like James_F's idea of separating the two [02:17:21] loop forever in dry run mode that is [02:17:47] err, right [02:18:01] that means we have to hold all the IDs in memory, right? [02:18:17] I can just add a tracker for cl_to and add a cl_to >= $last condition [02:18:28] how many are there? [02:19:36] checking... [02:19:55] anyways [02:19:58] er [02:20:05] RAM won't be an issue unless there are billions [02:20:13] fair [02:21:09] I have a question, local site notices are not seen on mobile on bnwiki. What is the reason for this? [02:21:11] is there a point in batching the select then? [02:23:07] my plain select against enwp is still running [02:23:19] yeah ok, batching makes sense then [02:23:34] MdsShakil: I'm not sure, but you might have better luck in #wikimedia-tech or asking on https://meta.wikimedia.org/wiki/Tech [02:24:14] in that case maybe have the dry run mode bail after one iteration [02:24:20] given that this is a one-off script [02:24:32] and just verify that the IDs it's outputting look sane [02:24:35] Also that. [02:24:54] ok [02:27:25] https://phabricator.wikimedia.org/P18741 [02:28:39] checking with ?curid= shows that all those page ids don't exist anymore [02:28:46] (really weird error message too, but that's another issue) [02:29:41] also my select even with LIMIT 500 on enwp is still going [02:29:59] it's scanning uh, 168M rows [02:30:14] | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | [02:30:14] +------+-------------+---------------+--------+---------------+--------------+---------+------------------------------+-----------+--------------------------------------+ [02:30:14] | 1 | SIMPLE | categorylinks | index | NULL | cl_timestamp | 261 | NULL | 168681299 | Using index | [02:30:14] | 1 | SIMPLE | page | eq_ref | PRIMARY | PRIMARY | 4 | enwiki.categorylinks.cl_from | 1 | Using where; Using index; Not exists | [02:31:02] I think the where cl_from >= will speed it up... [02:31:16] down to 87M rows [02:31:30] (via explain) [02:32:29] RECOVERY - ElasticSearch unassigned shard check - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [02:33:57] I'm not sure there's any faster way to do it [02:34:05] Ah well. [02:34:05] the commons categorylinks table is even bigger [02:34:45] Does Commons need the fix? [02:34:50] Edit rate is very much lower. [02:35:37] I wonder if we go in reverse order it'll be faster [02:35:55] because deleted pages are more likely to be recently created [02:36:01] yeah, c:CAT:CSD has miscounts [02:36:40] (03PS1) 10Scardenasmolinar: Change TheWikipediaLibrary editcount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754054 (https://phabricator.wikimedia.org/T288070) [02:37:36] ok, going backwards returns results faster [02:38:13] my LIMIT 500 query on Commons returned 500 results [02:38:37] oops [02:38:51] 18:29:40 also my select even with LIMIT 500 on enwp is still going <-- this was wrong, ignore [02:39:08] Hmm. [02:39:26] it took ~1m to return the first 500 results [02:39:33] is scanning 84m rows a lot? [02:39:41] I think it's going to perform ok [02:39:43] no, I screwed up [02:39:44] yeah it's fine [02:39:59] I did SELECT COUNT(*) ... LIMIT 500, which is useless [02:40:22] with the cl_from >= it's fine [02:40:56] :-) [02:41:00] so I think we clean up categorylinks first and then extend the script to all the other links tables [02:41:09] WFM. [02:41:40] well, clean up categorylinks, re-run recountCategories, then the other links tables [02:41:53] * James_F nods. [02:42:10] ok, please review https://phabricator.wikimedia.org/P18740 [02:43:28] ori, James_F ^ [02:44:02] Just wait for replication, no additional sleep? [02:44:16] It should be fine. [02:44:33] yeah I think it should be fine [02:44:41] +2 [02:45:07] LGTM but do you want a DBA on hand in case we reasoned badly? [02:46:09] we're probably 3-4 hours away from a DBA being awake I think [02:46:17] Or Monday. [02:46:30] are you worried about the queries/deletes being expensive or deleting the wrong thing? [02:46:37] "yes" [02:46:39] Well, Amir.1 will be around in a few hours probably. [02:47:09] your script looks correct and I think it's probably safe, but it's the production database [02:47:19] It's secondary data, ultimately. [02:47:33] If we dropped the entire table it'd be bad but recoverable without backups. [02:47:43] (Though editors would be Unhappy™.) [02:47:56] yeah, what you want to have on speed dial is a backup guy, not a dba [02:48:02] !!! [02:48:11] perfect timing jynus :) [02:48:11] jynus: Oops, did we accidentally summon you? [02:48:13] hahaha [02:48:18] I can imagine you just woke up in your sleep [02:48:21] (I am not here, BTW) [02:48:22] "something is not right" [02:48:28] Someone somewhere is doing something bad. [02:48:33] I can feel it in my fingers. [02:49:24] legoktm: Umherirrender just suggested refreshLinks.php --dfn-only [02:49:38] we have backups of the *link tables too, so don't worry [02:49:52] amazing, of course it already existed [02:49:57] $this->addOption( 'dfn-only', 'Delete links from nonexistent articles only' ); [02:50:01] Yup. [02:50:10] * ori headdesks [02:50:11] Apparently it's already a cron'ed job? [02:50:40] https://gerrit.wikimedia.org/g/operations/puppet/+/1415150baa54865bc30173de750c1a9f71ca8626/modules/profile/manifests/mediawiki/maintenance/refreshlinks/periodic_job.pp#6 [02:50:42] https://gerrit.wikimedia.org/g/operations/puppet/+/1415150baa54865bc30173de750c1a9f71ca8626/modules/profile/manifests/mediawiki/maintenance/refreshlinks/periodic_job.pp [02:50:44] Snap. [02:50:49] ok, this is totally safe to run then [02:50:58] Yeah, let's JFDI. [02:51:17] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:51:45] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:52:08] !log started mwscript refreshLinks.php --wiki=enwiki --dfn-only [02:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:52:56] I'll do a separate one for commons too [02:52:59] legoktm: You should mention the task in your !logs so that Phab stalkers get some info. [02:53:08] and then the rest can be foreachwikiindblist [02:53:20] Yeah. [02:54:14] !log started mwscript refreshLinks.php --wiki=enwiki --dfn-only (T299244) [02:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:17] T299244: {{PAGESINCATEGORY:Wikipedia:Nuweg}} not decreased when page is deleted - https://phabricator.wikimedia.org/T299244 [02:54:44] Ta [02:55:10] !log started mwscript refreshLinks.php --wiki=commonswiki --dfn-only (T299244) [02:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:56] for the rest I'm just going to start the systemd timers [02:57:21] if you can do it without googling you get a medal [02:57:52] I am trying to think if there would be any fancy and general way to revert changes in an easy way- a clearer event-driven system? An append only storage model? [02:59:13] RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:59:21] ori: "sudo systemctl start .."? :) [02:59:57] something I want to try is system versioning tables: https://mariadb.com/kb/en/system-versioned-tables/ but I suspect a db will explode if we use that on a large production db [03:01:11] BLOCKCHAIN [03:01:32] !log started refreshLinks --dfn-only via systemd units for s2-s6 (T299244) [03:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:36] T299244: {{PAGESINCATEGORY:Wikipedia:Nuweg}} not decreased when page is deleted - https://phabricator.wikimedia.org/T299244 [03:01:52] I'm watching https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&refresh=1m&var-site=eqiad&var-group=core&var-shard=s1&var-shard=s2&var-role=All [03:01:55] I mention it because I think Tim asked for a delayed replica recently, and that could provide a "continuous delayed replica" [03:02:25] there's a rows read spike but that could just be normal traffic(?) [03:02:26] the problem, as usual, is how to integrate old and new data for a recover- that I think is the biggest limitation of db recoveries [03:04:03] in theory *links tables all secondary/derivative data, but I get the feeling it would take much longer to reparse every page to rebuild it vs restore from a backup [03:04:37] Can you imagine re-building the Wikidata links tables from scratch? [03:05:07] the problem with a recovery is that by the time you do it, new data has been added, and it is non trivial to solve that issue [03:05:14] Yeah. [03:05:26] !log started refreshLinks --dfn-only via systemd units for s7-s8 (T299244) [03:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:42] We don't want to have the 3-day-downtime whilst we manually reconcile the binlogs of the DBs from back in the Tampa days again. [03:05:58] at least with the current model, that is way maybe a future simpler model could do "automatic merges", but that is is the far future [03:06:14] *there [03:06:30] legoktm: OK, can I slope off or do you need a second pair of eyes around still? [03:06:36] nope, go [03:06:42] I was about to bail in a few minutes, just updating the task [03:06:43] Awesome. Thanks for all your help. [03:06:46] Ack. [03:06:47] <3 [03:06:55] this was fun, let's not do it again for a few months [03:07:01] thanks to everyone that helped! [03:07:03] s/months/years/ [03:07:09] * ori waves [03:09:11] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:18:33] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:28:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [03:33:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [05:20:39] !log started recountCategories.php --wiki=enwiki --mode pages (T299244) [05:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:43] T299244: {{PAGESINCATEGORY:Wikipedia:Nuweg}} not decreased when page is deleted - https://phabricator.wikimedia.org/T299244 [05:20:59] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:26:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [05:27:30] CAT:CSD looks good on enwp [05:34:19] 10SRE, 10Language-Team (Language-2022-January-March): Deploy Flores MT secrets in Production for ContentTranslation - https://phabricator.wikimedia.org/T299023 (10KartikMistry) >>! In T299023#7623070, @Dzahn wrote: > Of course using GPG is fine as well. I just did not suggest it because usually people consider... [05:36:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [05:53:49] PROBLEM - SSH on restbase2011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:13:23] test [06:14:46] !log running recountCategories on s3 wikis [06:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:43] !log finished running recountCategories on s8 wikis (T299244) [06:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:47] T299244: Deleted pages are not being removed fron links tables, which also messes up category counts - https://phabricator.wikimedia.org/T299244 [06:19:07] ... [06:19:11] I didn't even get to log that I started it [06:19:38] !log finished running recountCategories on s5 wikis (T299244) [06:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:35] I guess the were too much [06:21:46] !log finished running recountCategories on s6 wikis (T299244) [06:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:50] !log finished running recountCategories on s3 wikis (T299244) [06:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:54] T299244: Deleted pages are not being removed fron links tables, which also messes up category counts - https://phabricator.wikimedia.org/T299244 [07:27:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [07:32:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [07:51:28] !log legoktm finished running recountCategories on s2 wikis (T299244) [07:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:33] T299244: Deleted pages are not being removed fron links tables, which also messes up category counts - https://phabricator.wikimedia.org/T299244 [07:56:15] RECOVERY - SSH on restbase2011.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:58:57] !log legoktm finished running recountCategories on s7 wikis (T299244) [07:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:01] T299244: Deleted pages are not being removed fron links tables, which also messes up category counts - https://phabricator.wikimedia.org/T299244 [07:59:44] ^ that's the last script, everything in the DB should be back to normal now [08:00:25] oh wait, I missed s4 [08:01:24] started [08:02:23] PROBLEM - SSH on restbase2010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:51:18] (03PS3) 10Giuseppe Lavagetto: build: add mypy types [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747104 (owner: 10Hashar) [08:51:53] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [08:51:53] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [08:52:34] on my phone [08:52:43] looks like equinix in ulsfo [08:55:37] I'm here too, checking [08:55:53] !log legoktm finished running recountCategories on s4 wikis (T299244) [08:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:57] T299244: Deleted pages are not being removed fron links tables, which also messes up category counts - https://phabricator.wikimedia.org/T299244 [08:56:56] still trying to figure out to which network it is [08:57:38] XioNoX: anything I can do to assist ? [08:59:11] * akosiaris around [08:59:27] how can I help? what do we know already? [08:59:27] not sure why it's not showing up in netflow [08:59:57] if you can help grepping logs [09:00:21] to figure out what requests, most likely upload are causing the spike [09:00:36] yeah it is upload, checking logs [09:00:41] and filter out their IPs or UA [09:00:53] I can't do that from my phone [09:01:04] ok, will do. [09:01:13] godog: is it just ulsfo upload? or more ? [09:01:40] akosiaris: good question, afaict ulsfo upload for now https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=ulsfo&var-cluster=cache_upload&var-instance=All&var-datasource=thanos [09:02:05] akosiaris: it looks like ulsfo only, and kinda only upload can cause such big spike [09:04:43] yeah, double checked it as well. it's ulsfo only. even codfw isn't seeing much. [09:07:03] still no joy, but still looking [09:12:40] same here [09:19:21] (03PS1) 10Filippo Giunchedi: varnish: temp ban Python-urllib/3.8 [puppet] - 10https://gerrit.wikimedia.org/r/754060 [09:27:40] (03CR) 10Filippo Giunchedi: [C: 03+2] varnish: temp ban Python-urllib/3.8 [puppet] - 10https://gerrit.wikimedia.org/r/754060 (owner: 10Filippo Giunchedi) [09:31:53] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [09:31:53] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [10:25:30] (03PS10) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for I9b40319d374143668a2666b42f59a3799d041afc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T298308) [10:25:55] (03CR) 10Winston Sung: [C: 03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T298308) (owner: 10Winston Sung) [11:00:19] PROBLEM - SSH on restbase2011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:06:21] RECOVERY - SSH on restbase2010.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:01:35] RECOVERY - SSH on restbase2011.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:03:16] (03PS1) 10Jelto: gitlab: update cloud hiera, refactor naming [puppet] - 10https://gerrit.wikimedia.org/r/754063 (https://phabricator.wikimedia.org/T297411) [12:08:39] (03PS2) 10Jelto: gitlab: update cloud hiera, refactor naming [puppet] - 10https://gerrit.wikimedia.org/r/754063 (https://phabricator.wikimedia.org/T297411) [12:11:02] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33265/console" [puppet] - 10https://gerrit.wikimedia.org/r/754063 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [12:18:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [12:30:21] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:38:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [13:30:06] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:55:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:56:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:09:57] PROBLEM - SSH on restbase2010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:11:15] RECOVERY - SSH on restbase2010.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:32:37] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:45:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] build: add mypy types [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747104 (owner: 10Hashar) [16:46:38] (03Merged) 10jenkins-bot: build: add mypy types [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747104 (owner: 10Hashar) [17:00:03] (03CR) 10Giuseppe Lavagetto: "I like the idea of throwing errors on undefined properties, but I would probably remove the changes to seed_image, as I plan to remove it " [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747060 (https://phabricator.wikimedia.org/T297619) (owner: 10Hashar) [17:00:08] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Be strict on undefined variables such as seed_image [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747060 (https://phabricator.wikimedia.org/T297619) (owner: 10Hashar) [18:57:59] PROBLEM - Disk space on ml-etcd2002 is CRITICAL: DISK CRITICAL - free space: / 717 MB (3% inode=95%): /tmp 717 MB (3% inode=95%): /var/tmp 717 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ml-etcd2002&var-datasource=codfw+prometheus/ops [19:00:45] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:11:33] PROBLEM - SSH on restbase2011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:03:25] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:12:51] RECOVERY - SSH on restbase2011.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:42:13] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:28:07] PROBLEM - SSH on ms-fe2008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:59:43] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 48.48 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1