[01:04:36] 10serviceops, 10SRE, 10Wikimedia-Apache-configuration: Investigate and restore K.A.Z httpbb test - https://phabricator.wikimedia.org/T289022 (10RLazarus) [07:40:43] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Jelto) [07:51:46] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [07:53:36] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) 05Stalled→03Open [08:37:48] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jelto@cumin1001 for hosts: `mw[1276-... [08:39:11] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Jelto) [09:18:25] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) a:05Dzahn→03Jelto Jelto, over to you, since you are removing the last 4 servers :) Just c... [09:19:27] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [09:21:37] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) 05Open→03Resolved done !:) All new servers are finally in production now. [09:24:42] 10serviceops, 10MW-on-K8s, 10SRE: Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10JMeybohm) I'd assume that MW makes HTTP calls to the public endpoints of MW. Those will be blocked in k8s as we generally prohibit egress traffic. I'm not sure this is the righ... [10:43:06] jayme , effie : I'm struggling with what I think are connectivity issues with flink streaming updater on eqiad/codfw (staging works) - it seems that my Job Manager can't connect to task managers [10:43:06] 12:23 [10:43:06] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/rdf-streaming-updater/ - helmfile with values [10:43:18] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/flink-session-cluster/ - chart [10:43:19] 12:24 [10:43:19] I'm not super familiar how to set up networking correctly, but it works on staging, so I assumed it should be ok on production as well, but that doesn't seem to be the case [10:43:34] (copied from -operations channel, because of course I posted this on a wrong one) [10:44:08] zpapierski: you have any logs/error messages? [10:44:17] from Flink I do [10:44:27] please share :) [10:44:31] or link [10:44:43] it usually goes something like this: [10:44:46] https://www.irccloud.com/pastebin/Mwlvxu4p/ [10:44:56] it's basically timing out [10:52:56] will take a look. Do you have proof this works in staging (log messages for example)? Or do you assume bacause you don't see the error logs there? [10:53:59] need a second to fine one - dcausse was working on it and said it worked, but I haven't been there by myself (he's on PTO rn) [10:54:07] s/fine/find [11:06:50] zpapierski: the logs you posted are from eqiad I suppose? [11:07:05] codfw [11:08:32] zpapierski: that's weird...the IPs in there are a) in eqiad range, b) not assigned to any pod [11:08:48] zpapierski: you know how flink learns about them? [11:08:51] ok, let me check, I might've confused envs [11:11:12] I do see logs like that in equads "flink-session-cluster-main-79975c64dd-88qhp" [11:12:26] yeah, sorry for the confusion [11:13:02] I'm still looking how to confirm that staging is running correctly [11:13:08] np... [11:14:33] I still struggle to understand what IPs flink chooses to connect to. The taskmanager pod in eqiad is running for 11d now and has 10.64.65.49, the logs say jobmanager tries 10.64.71.206 and 10.64.66.146 (both of them unused currently) [11:14:56] so connection timeout is super fine from my POV :) [11:17:09] I can confirm staging works (output topic is being populated [11:17:14] huh [11:17:44] why is this happening? [11:18:30] (I wish I could connect to flink ui on both eqiad and codfw) [11:19:10] I've no idea what flink does to get the IP's of task managers, so I don't really know. I had assumed it just calls the kubernetes api but that does not seem to be the case (or it just uses stale data [11:19:38] hmm [11:19:47] zpapierski: isn't that what you expose via nodeport tcp/4007? [11:19:52] the UI I mean? [11:20:04] that might've explain why staging works - there were no changes in pod configuration [11:21:24] I fear I'm unable to follow [11:22:13] it is a UI port, but I couldn't get it to work in my browser and haven't had time to figure out why it doesn't work [11:22:47] as for the issue - if flink is trying stale IPs, that could explain why staging works and eqiad/codfw isn't [11:23:16] staging always had 1 job manager, 1 task manager, so there's a chance that IPs never changed [11:23:26] that's clearly not the case now [11:23:37] anyway, that's something for me to check, thanks :) [11:27:21] zpapierski: as for the UI, something like "ssh -L 4007:kubernetes2001.codfw.wmnet:4007 deploy1002.eqiad.wmnet" and then "curl -k https://localhost:4007" works for me FWIW - although it's horrible slow [11:28:10] I'm getting [11:28:13] https://www.irccloud.com/pastebin/Cb7zgjRB/ [11:28:22] weird [11:28:26] https:// [11:28:45] ah, no redirect [11:28:56] there is no http port open [11:29:03] true [11:29:13] anyway, yep, wrong IPs [11:30:04] I wonder if forced reboot would help [11:30:27] hmm, it probably takes values from configmaps [11:34:57] from what I see, those are just pointing to the (currently running) job manager [11:35:04] kubectl -n rdf-streaming-updater get configmaps -l app=rdf-streaming-updater-eqiad-flink-cluster -o yaml | grep 10.64 [11:35:43] which means I might've been wrong about the whole thing [11:36:01] it's task managers that register into job manager, after all [11:48:22] not sure I understand the implications, but let me know if I can be of any help :) [11:49:00] one thing - how can I restart or rebuild a pod? [11:51:45] jayme: ^ [11:53:49] zpapierski: just added it to the docs, https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_restart [11:54:06] great, thx! [12:32:50] great [12:42:26] mutante: for when you have time: https://gerrit.wikimedia.org/r/713454 [12:42:53] my knowledge of apache is not good enough to say if that definitely fixes it [14:03:53] Amir1: have you tried it on mwdebug1001? [14:04:54] oh sorry, that site is not there [14:04:59] Yup [14:05:02] 😄😄 [14:06:04] if you add a pcc, we can merge and roll back if it goes wrong [14:17:14] 10serviceops, 10Push-Notification-Service, 10Product-Infrastructure-Team-Backlog (Kanban), 10User-jijiki: High latency on push notification service initialization - https://phabricator.wikimedia.org/T265258 (10MSantos) 05Open→03Resolved This issue hasn't happened recently. Please reopen in case I'm mis... [14:26:53] effie: sure [14:28:46] kostajh: are you around? I have a redis question for you [14:38:57] effie: hi [14:39:29] ohhi [14:39:50] so, at this point, how much data are you storing in mediawiki's redis servers? [14:41:04] effie: I don't know the answer to that. You mean the GrowthExperiments extension in particular, right? [14:41:12] prolly yes [14:41:55] effie: are there some capacity issues, is there some broader context for the question? [14:42:18] yes there is, https://phabricator.wikimedia.org/T280582 [14:42:37] the tldr is that, we are now sharding 2-2.5GB of data across 18 servers [14:42:46] and we would like to reduce this to 8 [14:43:04] so I was planning to start slowly removing one shard at a time [14:43:54] tha data of the shard I am taking offline will be lost [14:44:22] effie: can you look at the data? I assume you are seeing the "GrowthExperiments-NewcomerTasks-TaskSet" component? [14:44:38] I have not looked ta the data [14:46:18] effie: we plan to reduce the amount of data stored there, but we are also expanding to more wikis (including big ones like enwiki), so overall we'll probably see an increase in amount of data stored [14:47:26] what is your timeline and how much more data are we talking about? [14:47:57] I mean, what is your estimation of eg enwiki [14:50:41] that is a closely guarded secret ... aka I'm not sure. [14:51:24] enwiki in theory could happen in a month or two. Our features are activated for new user accounts, so it wouldn't be like a flood once it's enabled there [14:52:22] anyway, if you'd like us to reduce the data we use, we have already talked informally in our team about ways to optimize the caching mechanism (ie store IDs instead of TaskSet objects), so we could do it but we should make a phab task to discuss further & schedule the work [14:52:33] (stepping away for a while now) [15:15:29] effie: let me know when you feel you can deploy the change [15:15:32] I added PCC [15:32:23] sure, give me 2' to write something [15:53:03] Amir1: go ahead and have a go [15:53:55] Have a go at what? ðŸĪŠ [15:54:02] https://query.wikidata.org/querybuilder [15:54:11] works on my pc (tm) [15:54:23] Oooh nice [15:54:28] does it work in yours? [15:54:31] Can you merge it then? [15:54:37] I merged it [15:54:45] Oh nice [15:54:57] check if it works for you too [15:55:09] oh yes [15:55:17] Thanks <3 [15:55:17] cool, this is sorted then [15:55:23] awww [15:57:58] 10serviceops, 10Maps, 10SRE-swift-storage: Tegola staging doesn't connect to swift - https://phabricator.wikimedia.org/T289076 (10Jgiannelos) [15:59:02] effie: can I have a +2,please. - https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/713485 [15:59:14] sure, I am giving them away today [15:59:19] yay [15:59:37] can I have some, you know, if I need them, like, tomorrow? [16:00:00] I will do it now, as soon as jenkins does its +1 [16:00:33] it even gave me +2 [16:01:31] btw - I got the job to work [16:01:52] but thanks to issues in between I'm not sure if that's because of a single task manager, or restarting the cluster [16:01:58] ah, what was the issue then ? [16:02:06] ah yes, I meant its +2 [16:02:16] ÂŊ\_(ツ)_/ÂŊ [16:02:46] ah lol, I guess you will find out [16:02:49] I'll now if it works on 3 of those soon enough [16:02:56] s/now/know [16:03:02] damn homophones... [16:08:40] hahahaha [16:10:28] 10serviceops, 10SRE: mcrouter crashing on mwmaint2002 - https://phabricator.wikimedia.org/T288787 (10jijiki) 05Open→03Resolved https://gerrit.wikimedia.org/r/712920 is merged, closing this. [16:14:27] aand back to square on [16:14:29] one [16:14:36] so, 1 TM works, 3 it doesn't [16:14:54] at least I can say that eqiad/codfw behaves the same way as staging [16:15:52] the plot thickens [16:16:08] is it still complaining about timeouts? [16:17:51] no, those are gone [16:18:35] 10serviceops, 10Maps, 10SRE-swift-storage: Tegola staging doesn't connect to swift - https://phabricator.wikimedia.org/T289076 (10Jgiannelos) According to grafana, swift connections are failing since Aug 13 https://grafana.wikimedia.org/goto/FjxPVpn7k [16:20:58] so, again, Task Manager count is my only clue [16:30:51] 10serviceops, 10SRE, 10Performance-Team (Radar), 10User-jijiki: Shrink redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) @kostajh @krinkle, I would like to move this task forward. My plan is to remove one or two redis shard(s) per day, until we have 8 left. Right now the size of... [16:45:58] 10serviceops, 10Patch-For-Review, 10User-jijiki: Productionise mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T278225 (10jijiki) [16:47:15] 10serviceops, 10Maps, 10SRE-swift-storage, 10Patch-For-Review: Tegola staging doesn't connect to swift - https://phabricator.wikimedia.org/T289076 (10Jgiannelos) I guess its relevant to the bullseye upgrade: buster version: https://github.com/openstack/swift/blob/2.19.1/swift/common/middleware/s3api/s3api... [17:12:35] 10serviceops, 10SRE, 10Patch-For-Review: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10RLazarus) [17:24:39] Amir1: I see it was already merged by Effie :) thanks both [17:25:29] mutante: if you feel like it, I have another one for adding a probe for it :D [17:25:48] probe? [17:26:10] ideally we should add that to httpbb tests [17:26:19] https://gerrit.wikimedia.org/r/713497 [17:26:23] mutante: yup ^ [17:26:30] ok, great [17:26:47] cool, that's what I wanted. and looks good..sec [17:30:19] well.. this was unexpected [17:30:28] from deploy or cumin ..cant connect to miscweb1002 [17:30:40] but I am positive this used to work and we had added the ferm holes for it [17:30:52] need to look at that more [17:31:34] Ariel's law (It never takes 5 minutes) [17:31:50] good law [17:32:36] ACCEPT tcp -- deploy1002.eqiad.wmnet anywhere tcp dpt:http [17:32:40] um [17:33:09] ERROR: HTTPSConnectionPool(host='miscweb1002.eqiad.wmnet', port=443) [17:33:20] holes for http but now we use https by default? [17:33:26] did that change ? [17:34:25] no, it's something different. not even ferm.. [17:37:30] yea, I'll look at it tomorrow, it's 7.30 and I don't even see what the issue is, I can connect with curl [18:08:16] 10serviceops, 10SRE-swift-storage, 10envoy: Envoy and swift HEAD with 204 response turns into 503 - https://phabricator.wikimedia.org/T288815 (10RLazarus) a:03RLazarus [19:04:49] 10serviceops, 10SRE, 10Patch-For-Review: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10RLazarus) 05Open→03Resolved a:03RLazarus 🎉 With the dispatcher jobs migrated to systemd timers today, this is done! There are no maintenance cronjobs left. ` rzl@mwmaint2002:~$ s... [19:12:33] 10serviceops, 10SRE, 10SRE-tools, 10Spicerack, 10Datacenter-Switchover: Clean up cron-specific elements of switchdc cookbooks - https://phabricator.wikimedia.org/T289078 (10RLazarus) [19:20:19] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, and 2 others: Clean up cron-specific elements of switchdc cookbooks - https://phabricator.wikimedia.org/T289078 (10RLazarus) [21:15:51] 10serviceops, 10SRE, 10Performance-Team (Radar), 10User-jijiki: Shrink redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10Krinkle) Good to go from both of us. Last time we did maintenance (T252391) it was realized that the instrumentation that relies on the stronger persistence was no lo...