[00:00:05] RoanKattouw and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211110T0000). [00:00:05] Juan_90264 and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:47] Hello, I'm present [00:01:32] jouncebot: now [00:01:32] For the next 0 hour(s) and 58 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211110T0000) [00:02:14] present [00:04:47] RoanKattouw: ? [00:05:19] Urbanecm: ? [00:06:55] @thcipriani are you around? [00:07:12] Where are the deployers? They are never available at this time [00:09:42] brennen is anyone in RelEng able to run the backport window today? [00:11:11] (03CR) 10Cwhite: statistics::product_analytics: Update contact group for monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736916 (https://phabricator.wikimedia.org/T295381) (owner: 10Bearloga) [00:13:11] Krinkle: ? [00:13:18] Mmm, it does seem like there's been difficulty finding deployers for this window [00:14:11] ^ [00:14:14] I was just looking at that [00:14:38] Hello thcipriani [00:14:41] the least popular window over the past 5 years, too, I now realize [00:16:28] well I'll backport [00:18:11] (03PS4) 10Thcipriani: Add enwikibooks in wgImportSources to bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737081 (https://phabricator.wikimedia.org/T295051) (owner: 10Juan90264) [00:18:28] Okay [00:18:51] (03CR) 10Thcipriani: [C: 03+2] "CONFIG" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737081 (https://phabricator.wikimedia.org/T295051) (owner: 10Juan90264) [00:19:33] thcipriani: Perhaps this is something we could bring up in "Monthly engineering leaders meeting" on Dec 1st [00:19:40] and find some Tuesday deployers from WMF [00:19:42] (03Merged) 10jenkins-bot: Add enwikibooks in wgImportSources to bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737081 (https://phabricator.wikimedia.org/T295051) (owner: 10Juan90264) [00:20:21] Most people are doing it basically as volunteer basis [00:20:33] And releng are doing training people who want to learn how to do it [00:20:45] I had a short convo with urbane.cm about shifting some windows around. They're tilted a little late as it is. [00:21:18] (03CR) 10Cwhite: Add the first eventgate alert to Alertmanager (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/736490 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [00:21:20] Great merged [00:21:29] Juan_90264: live on mwdebug1002, check please [00:22:05] Okay [00:22:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:42] (03PS3) 10Thcipriani: Add mobile wordmark for foundation-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737773 (https://phabricator.wikimedia.org/T295303) (owner: 10Jdrewniak) [00:24:25] I think a "late" window is important, but it might be a bit too late right now, given that UTC late is now actually UTC very early because DST [00:25:31] yeah, shifting it by a couple of hours may make a big difference for deployer availability is my current thinking [00:26:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:17] thcipriani: that sounds like a good idea [00:26:23] I think available deployers dwindling in this window might actually be a sign that we're a little less US West Coast centric (he said optimistically) [00:26:32] TBH i would have used an earlier window if available [00:26:44] i'm not a big fan of 4pm backports so try 11am PST whereever possible :-) [00:27:00] thcipriani: I approved [00:27:07] Juan_90264: thanks, syncing [00:27:40] (also, UTC morning is now in the afternoon. isn't DST great?) [00:28:42] (this is because the deployment calendar is anchored to SF time) [00:28:46] when UTC collides with real world schedules: it's tough :) [00:28:58] Wouldn't they be able to change this window to 22:00 UTC? If I'm not mistaken, at this time Urbanec is still available [00:29:02] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:737081|Add enwikibooks in wgImportSources to bnwikibooks (T295051)]] (duration: 00m 56s) [00:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:06] T295051: Set "enwikibooks" as bnwikibooks import target - https://phabricator.wikimedia.org/T295051 [00:29:13] ^ Juan_90264 live now [00:29:20] Perfect [00:29:28] Ideally we wouldn't depend on just Martin for this too [00:29:33] (03CR) 10Thcipriani: [C: 03+2] Add mobile wordmark for foundation-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737773 (https://phabricator.wikimedia.org/T295303) (owner: 10Jdrewniak) [00:29:53] he's a bit busy with all of his hats :) [00:30:26] (03Merged) 10jenkins-bot: Add mobile wordmark for foundation-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737773 (https://phabricator.wikimedia.org/T295303) (owner: 10Jdrewniak) [00:30:36] @AntiComposite: Talk about placing at 22:00 UTC? [00:30:41] we have gotten some takers for deployment training https://wikitech.wikimedia.org/wiki/Deployments/Training [00:31:09] it's possible we could add some names to our roster (he added optimistically) [00:31:25] Jdlrobson: first one on mwdebug1002, check please [00:32:05] if I actually make a sticker from the picture on the page, maybe I could lure others [00:32:06] thcipriani: checking [00:32:45] (03PS2) 10Thcipriani: Add mobile logo and wordmark for metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737771 (https://phabricator.wikimedia.org/T295303) (owner: 10Jdrewniak) [00:33:14] RECOVERY - Disk space on webperf1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf1002&var-datasource=eqiad+prometheus/ops [00:33:41] They're finally putting a wordmark for metawiki, huh [00:34:27] thcipriani: first one = Add mobile wordmark for foundation-wiki ? [00:34:38] yep [00:34:40] hmm [00:34:54] foundationwiki = https://foundation.wikimedia.org/wiki/Home [00:34:56] right? [00:35:21] ya [00:35:25] :D seems true -- are you on mwdebug1002? [00:35:33] yeh... but not seeing the logo [00:36:03] the patch isn't right [00:36:06] 'foundationwiki' => '/static/images/mobile/copyright/wikimedia.svg', [00:36:11] but [00:36:13] static/images/mobile/copyright/wikimedia-wordmark.svg was added [00:36:14] RECOVERY - Disk space on webperf2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf2002&var-datasource=codfw+prometheus/ops [00:36:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:39] sure, /static/images/mobile/copyright/wikimedia.svg already exists... [00:36:43] but something looks a bit odd [00:37:20] there's 2 logos [00:37:30] @Jdl [00:37:39] the patch looks fine to me [00:37:40] Jdlrobson True, see: https://foundation.wikimedia.org/static/images/mobile/copyright/wikimedia.svg with mwdebug1002 [00:37:56] It should be kicking in on https://foundation.wikimedia.org/wiki/Home?useskinversion=2 [00:38:09] with the wikimedia.svg on the left and the wordmark on the right [00:38:44] Jdlrobson: try now, I forgot to rebase :\ [00:38:47] ahhh [00:38:48] now [00:39:05] there we go [00:39:08] that'll do it [00:39:19] ^ the joys of our fiddly deployments :D [00:39:28] it needs a bit of tweaking as that's huge on mobile, but that should be good to sync (it's better than the status quo [00:39:29] I swear I've done this before [00:39:33] I'll do a follow up to fix the sizing [00:39:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:52] (also wow the foundation main page does not look good on mobile) [00:40:13] * thcipriani syncing [00:40:35] I'm looking forward to how it will look on metawiki [00:40:55] And I liked the foundationwiki wordmark [00:41:01] !log thcipriani@deploy1002 Synchronized static/images/mobile/copyright/wikimedia-wordmark.svg: Config: [[gerrit:737773|Add mobile wordmark for foundation-wiki (T295303)]] (duration: 00m 56s) [00:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:05] T295303: Project names are too long for mobile - https://phabricator.wikimedia.org/T295303 [00:41:09] (03CR) 10Thcipriani: [C: 03+2] Add mobile logo and wordmark for metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737771 (https://phabricator.wikimedia.org/T295303) (owner: 10Jdrewniak) [00:42:15] AntiComposite: it sure could use some TemplateStyles love [00:42:31] (03Merged) 10jenkins-bot: Add mobile logo and wordmark for metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737771 (https://phabricator.wikimedia.org/T295303) (owner: 10Jdrewniak) [00:42:33] How about changing the time of this window to 22:00 UTC? [00:42:34] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:737773|Add mobile wordmark for foundation-wiki (T295303)]] (duration: 00m 55s) [00:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:07] (03PS2) 10Thcipriani: Set sampling rate for mobile click tracking to 100% on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737794 (https://phabricator.wikimedia.org/T294738) (owner: 10Jdlrobson) [00:43:27] Jdlrobson: the one for meta should be live on mwdebug1002 now, check please [00:43:32] (rebase for sure :)) [00:43:38] How about changing the time of this window to 22:00 UTC? [00:43:55] seems like a good idea :) [00:44:16] Juan_90264: we should jsut get you deployer access :) [00:44:26] ^ [00:44:34] (03PS1) 10Jdlrobson: Scale down the foundation wiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737808 (https://phabricator.wikimedia.org/T295303) [00:44:36] thcipriani: on it [00:44:40] <3 [00:44:55] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/737808 thcipriani i'm going to need this follow up. I'll move to Deployments calendar after testing [00:45:19] meta wiki is beautiful [00:45:22] that can be synced [00:45:41] bd808: I liked the idea, but I try to think about it later [00:46:18] Pretty much metawiki, finally [00:46:48] added https://gerrit.wikimedia.org/r/c/737808/ to deployment calendar. [00:46:56] https://gerrit.wikimedia.org/r/c/737794/ is a beta cluster only patch so hopefully not as eventful [00:47:20] Jdlrobson: you're uhh checking these locally before I'm syncing them, right? Why did you catch the too big logo only after it sync'd? [00:47:38] !log thcipriani@deploy1002 Synchronized static/images/mobile/copyright/: Config: [[gerrit:737771|Add mobile logo and wordmark for metawiki (T295303)]] (duration: 00m 56s) [00:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:41] T295303: Project names are too long for mobile - https://phabricator.wikimedia.org/T295303 [00:47:57] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:48:46] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:737771|Add mobile logo and wordmark for metawiki (T295303)]] (duration: 00m 55s) [00:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:52] I'll get 737808 out, but it feels a lot like debugging in prod which is no fun when you're the deployer :( [00:49:13] (03CR) 10Thcipriani: [C: 03+2] Set sampling rate for mobile click tracking to 100% on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737794 (https://phabricator.wikimedia.org/T294738) (owner: 10Jdlrobson) [00:49:39] (03PS2) 10Thcipriani: Scale down the foundation wiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737808 (https://phabricator.wikimedia.org/T295303) (owner: 10Jdlrobson) [00:49:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:55] (03Merged) 10jenkins-bot: Set sampling rate for mobile click tracking to 100% on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737794 (https://phabricator.wikimedia.org/T294738) (owner: 10Jdlrobson) [00:49:57] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:50:20] (03CR) 10Thcipriani: [C: 03+2] Scale down the foundation wiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737808 (https://phabricator.wikimedia.org/T295303) (owner: 10Jdlrobson) [00:50:38] thcipriani: yeh i'm checking them locally. I just figured it was better than the status quo even in the squashed form. Was that not the right way to do it? Should I have pushed the patch and deployed them together? [00:51:31] (03Merged) 10jenkins-bot: Scale down the foundation wiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737808 (https://phabricator.wikimedia.org/T295303) (owner: 10Jdlrobson) [00:51:45] (previously the logo looked like this https://phab.wmfusercontent.org/file/data/mxz6s632vo4u5hk6uph5/PHID-FILE-qblhnfhcgaqzn5ncqfor/Screen_Shot_2021-11-08_at_9.45.10_AM.png (but with different text) [00:52:50] live on mwdebug1002 [00:52:53] testing [00:53:03] perfect [00:53:06] * thcipriani syncs [00:53:07] desktop and mobile looking great [00:53:32] Now the logo is bigger than the wordmark [00:53:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:34] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:737808|Scale down the foundation wiki logo (T295303)]] (duration: 00m 56s) [00:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:37] T295303: Project names are too long for mobile - https://phabricator.wikimedia.org/T295303 [00:55:24] Juan_90264: that should be fine ,but that's why we have design review. I'll note that to our designer in case he wants to tweak it later [00:56:07] Jdlrobson: Okay [00:56:11] https://bg.wikipedia.org/ for example [00:57:09] Thanks thcipriani for all your help here [00:57:20] sure thing [00:57:28] and thanks Juan_90264 for the extra testing :) and Reedy for helping me debug the logo issue [00:57:45] ftr I get why you'd want to do it. I don't care if it's two patches. I asked because I deployed, you said: ~"better" then you made a patch to scale it down immediately. Which feels like a very slow and costly debug loop. [00:57:58] Jdlrobson: (y) [01:03:34] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [01:03:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [01:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:52] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:07:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [01:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:12] (03PS1) 10Jdlrobson: Lower mobile web click tracking rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737814 (https://phabricator.wikimedia.org/T295432) [01:16:14] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [01:18:14] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [03:11:24] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:40:46] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:42:10] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:58:46] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:59:26] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 237, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:15:49] !log T283606: running foreachwikiindblist growthexperiments extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --search-index [04:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:53] T283606: Add a link: too many articles have no suggestions upon arrival - https://phabricator.wikimedia.org/T283606 [04:26:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:59] (03PS3) 10Andrew Bogott: hieradata: add cloud-cumin04 [puppet] - 10https://gerrit.wikimedia.org/r/737709 (owner: 10Majavah) [04:33:35] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: add cloud-cumin04 [puppet] - 10https://gerrit.wikimedia.org/r/737709 (owner: 10Majavah) [04:45:28] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:46:07] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:53:34] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [05:09:20] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 431 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:13:36] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 20 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:29:57] (03PS3) 10Juan90264: Update $wgNamespacesToBeSearchedDefault for Wikimania 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737695 (https://phabricator.wikimedia.org/T295267) [05:39:16] 10SRE, 10DBA, 10cloud-services-team (Kanban): db1112 (s3 contribs/rc replica) is down - https://phabricator.wikimedia.org/T294295 (10Marostegui) @Cmjohnson yeah, we'd need to be scheduled as we need to stop mysql first. Let us know which day would work for you! Thank you [05:42:03] (03PS1) 10Marostegui: mariadb: Promote db1109 as s8 master [puppet] - 10https://gerrit.wikimedia.org/r/737831 (https://phabricator.wikimedia.org/T294321) [05:42:26] (03CR) 10Marostegui: [C: 04-2] "Wait for failover day" [puppet] - 10https://gerrit.wikimedia.org/r/737831 (https://phabricator.wikimedia.org/T294321) (owner: 10Marostegui) [05:45:43] (03PS1) 10Marostegui: wmnet: Update s8-master [dns] - 10https://gerrit.wikimedia.org/r/737832 (https://phabricator.wikimedia.org/T294321) [05:46:10] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/737832 (https://phabricator.wikimedia.org/T294321) (owner: 10Marostegui) [05:46:33] (03PS3) 10Juan90264: Enable the visual editor on the 2022 namespace on Wikimania wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737696 (https://phabricator.wikimedia.org/T295267) [06:18:12] RECOVERY - snapshot of s3 in codfw on alert1001 is OK: Last snapshot for s3 at codfw (db2139.codfw.wmnet:3313) taken on 2021-11-10 03:00:30 (1142 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [06:18:32] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1112 - DIMM replacement - https://phabricator.wikimedia.org/T294345 (10Marostegui) The DIMM has arrived, we are coordinating a date to power off the host at T294295 [06:19:06] 10SRE, 10DBA, 10cloud-services-team (Kanban): db1112 (s3 contribs/rc replica) is down - https://phabricator.wikimedia.org/T294295 (10Marostegui) 05Stalled→03Open [06:29:55] (03PS10) 10Ideophagous: Bug:T291737 Squashed two commits into one, previous commit comments follow: Bug:T291737 Change-Id: Ib263a5419c6ace911a597d025b28d6ef13549c10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735713 [06:34:26] (03CR) 10Marostegui: [C: 03+1] "So apart from changing the doc once submitted, we probably want to send an email to those who receive pages to let them know about this ne" [puppet] - 10https://gerrit.wikimedia.org/r/736415 (https://phabricator.wikimedia.org/T233684) (owner: 10Kormat) [06:41:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1109 with weight 0 T294321', diff saved to https://phabricator.wikimedia.org/P17715 and previous config saved to /var/cache/conftool/dbconfig/20211110-064120-root.json [06:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:24] T294321: Switchover s8 from db1104 to db1109 - https://phabricator.wikimedia.org/T294321 [06:42:33] (03CR) 10Giuseppe Lavagetto: Add apple-search deployment (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/736273 (https://phabricator.wikimedia.org/T289224) (owner: 10Giuseppe Lavagetto) [06:53:57] (03PS12) 10Giuseppe Lavagetto: mediawiki::php: support multiple php version in monitoring too [puppet] - 10https://gerrit.wikimedia.org/r/736949 (https://phabricator.wikimedia.org/T293450) [06:53:59] (03PS8) 10Giuseppe Lavagetto: mediawiki: add support for multiple versions in the web configuration [puppet] - 10https://gerrit.wikimedia.org/r/737330 (https://phabricator.wikimedia.org/T293450) [06:54:01] (03PS1) 10Giuseppe Lavagetto: php: remove backwards compatibility layer [puppet] - 10https://gerrit.wikimedia.org/r/737836 [06:57:07] (03CR) 10jerkins-bot: [V: 04-1] php: remove backwards compatibility layer [puppet] - 10https://gerrit.wikimedia.org/r/737836 (owner: 10Giuseppe Lavagetto) [06:57:16] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: add support for multiple versions in the web configuration [puppet] - 10https://gerrit.wikimedia.org/r/737330 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [07:01:06] (03PS2) 10Giuseppe Lavagetto: php: remove backwards compatibility layer [puppet] - 10https://gerrit.wikimedia.org/r/737836 [07:01:08] (03PS13) 10Giuseppe Lavagetto: mediawiki::php: support multiple php version in monitoring too [puppet] - 10https://gerrit.wikimedia.org/r/736949 (https://phabricator.wikimedia.org/T293450) [07:01:10] (03PS9) 10Giuseppe Lavagetto: mediawiki: add support for multiple versions in the web configuration [puppet] - 10https://gerrit.wikimedia.org/r/737330 (https://phabricator.wikimedia.org/T293450) [07:02:52] (03CR) 10jerkins-bot: [V: 04-1] php: remove backwards compatibility layer [puppet] - 10https://gerrit.wikimedia.org/r/737836 (owner: 10Giuseppe Lavagetto) [07:03:13] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32313/console" [puppet] - 10https://gerrit.wikimedia.org/r/737836 (owner: 10Giuseppe Lavagetto) [07:03:53] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: add support for multiple versions in the web configuration [puppet] - 10https://gerrit.wikimedia.org/r/737330 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [07:07:08] (03Abandoned) 10Elukey: profile::presto::server: use a stricter truststore [puppet] - 10https://gerrit.wikimedia.org/r/737065 (owner: 10Elukey) [07:11:43] (03PS2) 10Elukey: Reduce verbosity of the log commit message [cookbooks] - 10https://gerrit.wikimedia.org/r/737706 [07:13:31] (03CR) 10Elukey: "@Razzi: tried to follow up and renamed the self.reason attribute to self.admin_reason, so now the string should be retrieved via self.admi" [cookbooks] - 10https://gerrit.wikimedia.org/r/737706 (owner: 10Elukey) [07:15:42] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:17:21] (03PS1) 10Marostegui: wmnet: Replace m5-master with dbproxy1017 [dns] - 10https://gerrit.wikimedia.org/r/737837 (https://phabricator.wikimedia.org/T288093) [07:17:34] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover hour" [dns] - 10https://gerrit.wikimedia.org/r/737837 (https://phabricator.wikimedia.org/T288093) (owner: 10Marostegui) [07:26:21] (03CR) 10Elukey: "Some nits and comments, but it looks good! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/737756 (owner: 10Jbond) [07:33:23] 10SRE, 10vm-requests: : of VMs requested for - https://phabricator.wikimedia.org/T295436 (10Wiphawrrnb63) [07:41:34] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:49:04] (03PS4) 10ArielGlenn: add credentials file for downloading enterprise html dumps [puppet] - 10https://gerrit.wikimedia.org/r/736461 (https://phabricator.wikimedia.org/T273585) [07:50:04] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:52:18] (03CR) 10ArielGlenn: [C: 03+2] add credentials file for downloading enterprise html dumps [puppet] - 10https://gerrit.wikimedia.org/r/736461 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [07:52:34] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:53:32] andrewbogott: can I merge yoyr "hieradata: add cloud-cumin04" change? [07:54:02] (on puppetmaster1001) [07:54:42] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:54:43] apergos: andrew is likely asleep and it was merged a few hours ago, but it's my patch and it should be safe to merge [07:54:53] ok, doing so [07:56:02] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [08:14:20] (03CR) 10Volans: [C: 03+2] sre.hosts.dhcp: add new cookbook to setup DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/737753 (owner: 10Volans) [08:18:35] (03Merged) 10jenkins-bot: sre.hosts.dhcp: add new cookbook to setup DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/737753 (owner: 10Volans) [08:22:26] !log volans@cumin1001 START - Cookbook sre.hosts.dhcp for host ganeti6004.drmrs.wmnet [08:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:04] (03PS5) 10Arturo Borrero Gonzalez: wmcs: add openstack network tests cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/737667 (https://phabricator.wikimedia.org/T294955) [08:29:27] (03CR) 10jerkins-bot: [V: 04-1] wmcs: add openstack network tests cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/737667 (https://phabricator.wikimedia.org/T294955) (owner: 10Arturo Borrero Gonzalez) [08:32:47] (03PS6) 10Arturo Borrero Gonzalez: wmcs: add openstack network tests cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/737667 (https://phabricator.wikimedia.org/T294955) [08:33:51] (03PS1) 10Jelto: aptrepo::files::updates Update gitlab-ce and gitlab-runner to 14.4 [puppet] - 10https://gerrit.wikimedia.org/r/737847 (https://phabricator.wikimedia.org/T294580) [08:39:09] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host ganeti6004.drmrs.wmnet [08:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:12] !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti6004.drmrs.wmnet with OS buster [08:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:22] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by volans@cumin1001 for host ganeti6004.drmrs.wmnet with OS buster [08:44:00] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:52:53] <_joe_> anyone else having issues with gerrit? [08:53:05] (03PS3) 10Giuseppe Lavagetto: php: remove backwards compatibility layer [puppet] - 10https://gerrit.wikimedia.org/r/737836 [08:53:07] (03PS14) 10Giuseppe Lavagetto: mediawiki::php: support multiple php version in monitoring too [puppet] - 10https://gerrit.wikimedia.org/r/736949 (https://phabricator.wikimedia.org/T293450) [08:53:09] (03PS10) 10Giuseppe Lavagetto: mediawiki: add support for multiple versions in the web configuration [puppet] - 10https://gerrit.wikimedia.org/r/737330 (https://phabricator.wikimedia.org/T293450) [08:53:14] <_joe_> heh transient issues I'd say [08:54:34] (03PS1) 10ArielGlenn: [WIP] add enterprise html dumps downloader settings file and systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/737854 (https://phabricator.wikimedia.org/T273585) [08:55:23] (03CR) 10jerkins-bot: [V: 04-1] [WIP] add enterprise html dumps downloader settings file and systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/737854 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [08:58:17] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 11): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32315/console" [puppet] - 10https://gerrit.wikimedia.org/r/737836 (owner: 10Giuseppe Lavagetto) [09:01:27] (03PS2) 10ArielGlenn: [WIP] add enterprise html dumps downloader settings file and systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/737854 (https://phabricator.wikimedia.org/T273585) [09:14:41] (03CR) 10Elukey: "Forgot to mention - we can drop the bundle that I added to puppet sslcert's file directory since it is not needed anymore." [puppet] - 10https://gerrit.wikimedia.org/r/737756 (owner: 10Jbond) [09:19:20] (03PS7) 10Arturo Borrero Gonzalez: wmcs: add openstack network tests cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/737667 (https://phabricator.wikimedia.org/T294955) [09:22:05] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti6004.drmrs.wmnet with OS buster [09:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:14] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by volans@cumin1001 for host ganeti6004.drmrs.wmnet with OS buster completed: - ganeti6004 (**WA... [09:29:38] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:31:02] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:35:17] (03PS8) 10Jbond: P:base::certificates: refactor jks trust-store [puppet] - 10https://gerrit.wikimedia.org/r/737756 [09:38:39] (03PS9) 10Jbond: P:base::certificates: refactor jks trust-store [puppet] - 10https://gerrit.wikimedia.org/r/737756 [09:38:58] !log Upgrade db1124, db1125, db1133 and pc2014 to mariadb 10.4.22 [09:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:30] (03PS8) 10Arturo Borrero Gonzalez: wmcs: add openstack network tests cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/737667 (https://phabricator.wikimedia.org/T294955) [09:41:00] (03CR) 10David Caro: "One question, the rest are just nits (feel free to ignore)" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/737667 (https://phabricator.wikimedia.org/T294955) (owner: 10Arturo Borrero Gonzalez) [09:42:23] (03PS9) 10Arturo Borrero Gonzalez: wmcs: add openstack network tests cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/737667 (https://phabricator.wikimedia.org/T294955) [09:42:42] (03PS1) 10Majavah: aptrepo: add component for rackspace openstack debs [puppet] - 10https://gerrit.wikimedia.org/r/737856 (https://phabricator.wikimedia.org/T295234) [09:46:16] (03PS1) 10Thiemo Kreuz (WMDE): Streamline/modernize code in MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737857 [09:47:27] (03PS1) 10Thiemo Kreuz (WMDE): Remove unused code from StaticSiteConfiguration class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737858 [09:49:24] (03PS10) 10Jbond: P:base::certificates: refactor jks trust-store [puppet] - 10https://gerrit.wikimedia.org/r/737756 [09:49:59] (03PS1) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 [09:50:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32317/console" [puppet] - 10https://gerrit.wikimedia.org/r/737756 (owner: 10Jbond) [09:51:01] (03CR) 10jerkins-bot: [V: 04-1] Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE)) [09:52:30] (03CR) 10Jbond: [C: 03+1] "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/737756 (owner: 10Jbond) [09:52:33] (03PS1) 10David Caro: ceph::client::rbd_glance: remove keyring generation [puppet] - 10https://gerrit.wikimedia.org/r/737860 (https://phabricator.wikimedia.org/T293752) [09:54:13] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] "pcc says this change is a noop everywhere." [puppet] - 10https://gerrit.wikimedia.org/r/737836 (owner: 10Giuseppe Lavagetto) [09:56:51] (03CR) 10Elukey: [C: 03+1] "LGTM, let's give it a try :)" [puppet] - 10https://gerrit.wikimedia.org/r/737756 (owner: 10Jbond) [09:58:05] (03CR) 10Elukey: [C: 03+1] "We will probably need to run something like:" [puppet] - 10https://gerrit.wikimedia.org/r/737756 (owner: 10Jbond) [09:59:34] 10SRE, 10Trash: this is a test subtask - https://phabricator.wikimedia.org/T120033 (10Aklapper) 05Resolved→03Invalid [09:59:40] 10SRE, 10Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176 (10Aklapper) [10:02:12] (03PS6) 10Jbond: P:openstack::base::cloudgw: drop unneeded profiles [puppet] - 10https://gerrit.wikimedia.org/r/737774 [10:04:08] (03PS2) 10Vgutierrez: varnish: Mimick XFF behaviour with UDS + PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/737755 (https://phabricator.wikimedia.org/T290005) [10:04:59] (03PS7) 10Jbond: P:openstack::base::cloudgw: drop unneeded profiles [puppet] - 10https://gerrit.wikimedia.org/r/737774 [10:09:01] (03PS3) 10Vgutierrez: varnish: Mimick XFF behaviour with UDS + PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/737755 (https://phabricator.wikimedia.org/T290005) [10:10:19] (03CR) 10Elukey: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32319/console" [puppet] - 10https://gerrit.wikimedia.org/r/737756 (owner: 10Jbond) [10:12:07] (03PS1) 10Jbond: admin: drop jmixter user [puppet] - 10https://gerrit.wikimedia.org/r/737864 [10:12:51] (03CR) 10jerkins-bot: [V: 04-1] admin: drop jmixter user [puppet] - 10https://gerrit.wikimedia.org/r/737864 (owner: 10Jbond) [10:13:18] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:base::certificates: refactor jks trust-store (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737756 (owner: 10Jbond) [10:13:24] elukey: FYI merging now ^^ [10:13:50] <_joe_> jbond_: merge mine as well [10:14:04] <_joe_> i was about to ask you if i could merge your change :) [10:14:18] (03PS4) 10Btullis: Update the times at which refine_sanitize monitor jobs are run [puppet] - 10https://gerrit.wikimedia.org/r/737650 [10:14:22] _joe_: says its stilllocked by you [10:14:35] nevermind merging now [10:14:38] <_joe_> jbond_: heh freed [10:14:52] merged [10:15:59] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32321/console" [puppet] - 10https://gerrit.wikimedia.org/r/737650 (owner: 10Btullis) [10:17:10] jbond_: ack thanks! [10:17:16] going to apply it to 1006 and check [10:17:17] <_joe_> jbond_: I see the puppet ca crt has been redefined? [10:17:21] (03PS2) 10David Caro: ceph::client::rbd_glance: remove keyring generation [puppet] - 10https://gerrit.wikimedia.org/r/737860 (https://phabricator.wikimedia.org/T293752) [10:17:22] <_joe_> is that normal? [10:17:23] (03PS1) 10David Caro: p:ceph::client::rbd_cloudcontrol: remove keyring generation [puppet] - 10https://gerrit.wikimedia.org/r/737869 (https://phabricator.wikimedia.org/T293752) [10:17:26] <_joe_> on a random appserver [10:17:43] _joe_: just in /etc/ssl/localcerts/Puppet_Internal_CA.crt ritgh? [10:17:48] <_joe_> yep [10:17:56] yes its just a copy to do a jks hack [10:18:00] https://gerrit.wikimedia.org/r/c/operations/puppet/+/737756 [10:18:03] it is not used yet, we are using it for the truststores etc.. [10:18:11] <_joe_> ack [10:18:50] elukey: fyi im also going to delete the old jks certificate as it currently has the puppet ca twice [10:20:08] (03CR) 10Alexandros Kosiaris: [C: 04-1] mediawiki::php: support multiple php version in monitoring too (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/736949 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [10:20:30] jbond_: <3 [10:20:40] elukey: /etc/ssl/localcerts/trusted_root_ca.jks looks good to me now (on kafka-test1006) [10:20:40] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.03725 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:21:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:21:33] * jbond_ looking at puppet failures [10:21:44] lmk if you need a hand [10:21:55] jbond_: Could not find command '/usr/bin/cat' [10:22:05] (03CR) 10Lucas Werkmeister (WMDE): "Hm, this replaces the 2021 namespace instead of adding to it… maybe we should leave both to be searched for now, and remove the 2021 one o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737695 (https://phabricator.wikimedia.org/T295267) (owner: 10Juan90264) [10:22:11] ahh thanks will fix now [10:23:54] (03PS1) 10Jbond: sslcer: use /bin/cat not /usr/bin/cat [puppet] - 10https://gerrit.wikimedia.org/r/737872 [10:24:19] (03CR) 10Jbond: [V: 03+2 C: 03+2] sslcer: use /bin/cat not /usr/bin/cat [puppet] - 10https://gerrit.wikimedia.org/r/737872 (owner: 10Jbond) [10:26:05] looks good now! [10:26:17] cool [10:26:28] just running with cumin to clear up the failed ones [10:26:38] k [10:28:44] (03PS5) 10Btullis: Update the times at which refine_sanitize monitor jobs are run [puppet] - 10https://gerrit.wikimedia.org/r/737650 [10:30:42] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005643 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:31:18] (03PS1) 10Elukey: knative-serving: add ingress network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/737875 (https://phabricator.wikimedia.org/T289834) [10:34:09] (03PS2) 10Elukey: knative-serving: add ingress network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/737875 (https://phabricator.wikimedia.org/T289834) [10:35:57] (03CR) 10Kormat: [C: 03+1] mariadb: Promote db1109 as s8 master [puppet] - 10https://gerrit.wikimedia.org/r/737831 (https://phabricator.wikimedia.org/T294321) (owner: 10Marostegui) [10:36:14] (03CR) 10Kormat: [C: 03+1] wmnet: Update s8-master [dns] - 10https://gerrit.wikimedia.org/r/737832 (https://phabricator.wikimedia.org/T294321) (owner: 10Marostegui) [10:36:54] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-test1008 is CRITICAL: 10 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1008 [10:37:24] (03CR) 10Kormat: [C: 03+1] wmnet: Replace m5-master with dbproxy1017 [dns] - 10https://gerrit.wikimedia.org/r/737837 (https://phabricator.wikimedia.org/T288093) (owner: 10Marostegui) [10:39:33] that's a hattrick [10:39:51] :) [10:41:15] (03PS2) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 [10:42:05] (03CR) 10jerkins-bot: [V: 04-1] Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE)) [10:43:29] checking kafka test [10:51:01] (03PS1) 10Elukey: profile::base::certificates: renamed jks/p12 truststores [puppet] - 10https://gerrit.wikimedia.org/r/737882 [10:53:25] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-test1008 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1008 [10:53:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] p:ceph::client::rbd_cloudcontrol: remove keyring generation [puppet] - 10https://gerrit.wikimedia.org/r/737869 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [10:53:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph::client::rbd_glance: remove keyring generation [puppet] - 10https://gerrit.wikimedia.org/r/737860 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [10:54:18] (03CR) 10Jbond: [C: 03+1] profile::base::certificates: renamed jks/p12 truststores [puppet] - 10https://gerrit.wikimedia.org/r/737882 (owner: 10Elukey) [10:54:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] aptrepo: add component for rackspace openstack debs [puppet] - 10https://gerrit.wikimedia.org/r/737856 (https://phabricator.wikimedia.org/T295234) (owner: 10Majavah) [10:55:33] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32327/console" [puppet] - 10https://gerrit.wikimedia.org/r/737882 (owner: 10Elukey) [10:56:19] 10SRE, 10Infrastructure-Foundations, 10netops: Rebuild Routinator (rpki) VMs with larger disk - https://phabricator.wikimedia.org/T292503 (10cmooney) A security update is now available which means we need to upgrade again: https://www.nlnetlabs.nl/news/2021/Nov/09/routinator-0.10.2-released/ I'll dig into... [10:56:21] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::base::certificates: renamed jks/p12 truststores [puppet] - 10https://gerrit.wikimedia.org/r/737882 (owner: 10Elukey) [10:56:37] (03PS1) 10JMeybohm: Add an update-ca-certificates hook maintaining wmf-ca-certificates.crt [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/737884 [10:56:39] (03PS1) 10JMeybohm: Install update-ca-certificates hook maintaining wmf-ca-certificates.crt [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/737885 [11:02:46] (03PS1) 10Jgiannelos: tegola-vector-tiles: Revert cronjob schedule for codfw to default [deployment-charts] - 10https://gerrit.wikimedia.org/r/737886 [11:03:27] (03PS2) 10JMeybohm: Add an update-ca-certificates hook maintaining wmf-ca-certificates.crt [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/737884 [11:03:29] (03PS2) 10JMeybohm: Install update-ca-certificates hook maintaining wmf-ca-certificates.crt [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/737885 [11:06:31] (03PS1) 10Jgiannelos: tile-pregeneration: Revert batching for performance reasons [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/737888 [11:07:49] (03PS1) 10Btullis: Add port 4400/tcp to the mysql-replica rule in the analytics policy [homer/public] - 10https://gerrit.wikimedia.org/r/737889 (https://phabricator.wikimedia.org/T295312) [11:10:11] Hi we would like to make some changes in the kafka partitions setup for maps topics. Who would be the right team/person to reach out to? [11:18:52] (03Abandoned) 10Ladsgroup: Increase logging level of DBPerformance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737775 (owner: 10Ladsgroup) [11:20:52] (03PS5) 10JMeybohm: Add basic ingress support to chart common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) [11:21:13] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/737169 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [11:27:12] (03CR) 10JMeybohm: [C: 03+2] Add cfssl-issuer and cfssl-issuer-crds chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/737169 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [11:32:15] (03Merged) 10jenkins-bot: Add cfssl-issuer and cfssl-issuer-crds chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/737169 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [11:32:17] (03PS1) 10Jbond: check_user: switch to using email address [puppet] - 10https://gerrit.wikimedia.org/r/737893 (https://phabricator.wikimedia.org/T259746) [11:34:21] (03CR) 10Jbond: [C: 03+2] check_user: switch to using email address [puppet] - 10https://gerrit.wikimedia.org/r/737893 (https://phabricator.wikimedia.org/T259746) (owner: 10Jbond) [11:40:41] (03PS4) 10Vgutierrez: varnish: Mimick XFF behaviour with UDS + PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/737755 (https://phabricator.wikimedia.org/T290005) [11:47:30] (03PS2) 10Btullis: Add port 4400/tcp to the mysql-replica rule in the analytics policy [homer/public] - 10https://gerrit.wikimedia.org/r/737889 (https://phabricator.wikimedia.org/T295312) [11:52:02] PROBLEM - Check systemd state on kafka-test1006 is CRITICAL: CRITICAL - degraded: The following units failed: kafka-mirror-jumbo-eqiad_to_test-eqiad@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:56:40] (03PS1) 10Jgiannelos: maps: Add cronjob to send tile invalidation events [puppet] - 10https://gerrit.wikimedia.org/r/737898 (https://phabricator.wikimedia.org/T270175) [11:57:11] (03CR) 10jerkins-bot: [V: 04-1] maps: Add cronjob to send tile invalidation events [puppet] - 10https://gerrit.wikimedia.org/r/737898 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [11:58:20] (03PS2) 10Jgiannelos: maps: Add cronjob to send tile invalidation events [puppet] - 10https://gerrit.wikimedia.org/r/737898 (https://phabricator.wikimedia.org/T270175) [11:58:56] (03CR) 10jerkins-bot: [V: 04-1] maps: Add cronjob to send tile invalidation events [puppet] - 10https://gerrit.wikimedia.org/r/737898 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211110T1200). [12:00:05] Juan_90264 and Lucas_WMDE: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:16] o/ [12:00:19] I can deploy today [12:00:20] Hello, I'm present [12:00:24] hi! [12:00:41] I lef a comment on one of your changes, did you see it? [12:03:16] What should I have seen? [12:03:39] (03CR) 10Lucas Werkmeister (WMDE): Update $wgNamespacesToBeSearchedDefault for Wikimania 2022 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737695 (https://phabricator.wikimedia.org/T295267) (owner: 10Juan90264) [12:03:54] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:04:05] not sure how to link to the comment specifically, but it’s at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/737695 [12:04:08] (03PS1) 10Elukey: Restore previous TLS settings for Kafka test [puppet] - 10https://gerrit.wikimedia.org/r/737899 [12:04:21] Oh yes, I saw the comment and I agree with you. Ignore this Gerrit who commented [12:04:41] (03CR) 10Lucas Werkmeister (WMDE): Update $wgNamespacesToBeSearchedDefault for Wikimania 2022 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737695 (https://phabricator.wikimedia.org/T295267) (owner: 10Juan90264) [12:05:44] (03CR) 10Juan90264: Update $wgNamespacesToBeSearchedDefault for Wikimania 2022 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737695 (https://phabricator.wikimedia.org/T295267) (owner: 10Juan90264) [12:06:00] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:06:07] ok, do you want to upload a new patch set? [12:06:12] we can deploy the first change in the meantime [12:06:19] !log wikiadmin@10.64.48.109(centralauth)> delete from globalnames where gn_name='AAnctil (WMF)'; # to let OIT create that account globally, SULification of foundationwiki, T205347 [12:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:22] T205347: Enable SUL accounts on Governance wiki - https://phabricator.wikimedia.org/T205347 [12:06:24] (03CR) 10Ayounsi: Add port 4400/tcp to the mysql-replica rule in the analytics policy (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/737889 (https://phabricator.wikimedia.org/T295312) (owner: 10Btullis) [12:06:41] !log wikiadmin@10.64.48.109(centralauth)> select * from localnames where ln_name='AAnctil (WMF)'; # to let OIT create that account globally, SULification of foundationwiki, T205347 [12:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:50] (03CR) 10Elukey: [C: 03+2] Restore previous TLS settings for Kafka test [puppet] - 10https://gerrit.wikimedia.org/r/737899 (owner: 10Elukey) [12:06:59] eh, wrong query [12:07:19] !log wikiadmin@10.64.48.109(centralauth)> delete from localnames where ln_wiki='foundationwiki' and ln_name='AAnctil (WMF)'; # to let OIT create that account globally, SULification of foundationwiki, T205347 [12:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:35] !log wikiadmin@10.64.48.109(centralauth)> delete from localnames where ln_name='DJemielniak (WMF)' and ln_wiki='foundationwiki'; # to let OIT create that account globally, SULification of foundationwiki, T205347 [12:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:45] !log wikiadmin@10.64.48.109(centralauth)> delete from globalnames where gn_name='DJemielniak (WMF)'; # to let OIT create that account globally, SULification of foundationwiki, T205347 [12:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:25] Lucas_WMDE: Can deploy the first in the meantime [12:08:33] yeah, sure [12:09:11] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Namespace 134 is free: https://wikimania.wikimedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737082 (https://phabricator.wikimedia.org/T295267) (owner: 10Bodhisattwa) [12:10:13] (03PS3) 10Btullis: Add port 4400/tcp to the mysql-replica rule in the analytics policy [homer/public] - 10https://gerrit.wikimedia.org/r/737889 (https://phabricator.wikimedia.org/T295312) [12:12:09] (03PS3) 10Lucas Werkmeister (WMDE): create 2022 namespace for wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737082 (https://phabricator.wikimedia.org/T295267) (owner: 10Bodhisattwa) [12:12:15] (03PS1) 10Vgutierrez: cache:haproxy: Enable hitless reloads [puppet] - 10https://gerrit.wikimedia.org/r/737902 (https://phabricator.wikimedia.org/T290005) [12:12:22] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "no auto-rebase…?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737082 (https://phabricator.wikimedia.org/T295267) (owner: 10Bodhisattwa) [12:13:26] (03Merged) 10jenkins-bot: create 2022 namespace for wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737082 (https://phabricator.wikimedia.org/T295267) (owner: 10Bodhisattwa) [12:14:45] (03CR) 10Ayounsi: [C: 03+2] Update bblack ssh key [homer/public] - 10https://gerrit.wikimedia.org/r/737765 (owner: 10BBlack) [12:15:07] (03CR) 10Jgiannelos: "* I am mostly adapting code from other cron definitions we have in the codebase" [puppet] - 10https://gerrit.wikimedia.org/r/737898 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [12:15:19] (03Merged) 10jenkins-bot: Update bblack ssh key [homer/public] - 10https://gerrit.wikimedia.org/r/737765 (owner: 10BBlack) [12:15:38] Juan_90264: the change is on mwdebug1001, can you test it? [12:16:25] Yes, I can [12:16:51] (03CR) 10Majavah: [C: 04-1] maps: Add cronjob to send tile invalidation events (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737898 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [12:18:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:24] Lucas_WMDE: I tested and approved [12:18:33] ok, to me it looks good as well [12:18:42] (03CR) 10Btullis: Add port 4400/tcp to the mysql-replica rule in the analytics policy (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/737889 (https://phabricator.wikimedia.org/T295312) (owner: 10Btullis) [12:18:48] (03CR) 10Btullis: [C: 03+2] Add port 4400/tcp to the mysql-replica rule in the analytics policy [homer/public] - 10https://gerrit.wikimedia.org/r/737889 (https://phabricator.wikimedia.org/T295312) (owner: 10Btullis) [12:19:06] syncing [12:19:57] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:737082|create 2022 namespace for wikimaniawiki (T295267)]] (duration: 00m 56s) [12:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:01] T295267: create 2022 namespace for wikimaniawiki - https://phabricator.wikimedia.org/T295267 [12:20:44] alright, first change done [12:20:52] !log Connect `Jbuatti (WMF)@foundationwiki` to SUL [12:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:58] can you upload a new version of the second change or do you want to wait with that until later? [13:01:42] (03CR) 10Urbanecm: [C: 03+2] Revert "[beta] Enable CentralAuth on foundationwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737908 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [13:01:46] (03PS2) 10Urbanecm: Revert "[beta] Enable CentralAuth on foundationwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737908 (https://phabricator.wikimedia.org/T205347) [13:01:49] (03CR) 10Urbanecm: Revert "[beta] Enable CentralAuth on foundationwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737908 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [13:01:53] (03CR) 10Urbanecm: [C: 03+2] Revert "[beta] Enable CentralAuth on foundationwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737908 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [13:02:35] (03Merged) 10jenkins-bot: Revert "[beta] Enable CentralAuth on foundationwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737908 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [13:03:07] * urbanecm done [13:03:34] !log UTC morning backport+config window done [13:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:53] (bit strange that UTC morning now starts at noon but eh ^^) [13:05:00] Thanks Lucas_WMDE for deploying :) [13:05:09] np, thanks for the patches :) [13:06:17] Does anyone know how I get that sticker that Jouncebot quotes during deployments? [13:08:33] "Note: If you break AND fix the wikis, you will be rewarded with a sticker." [13:10:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:09] not sure if there’s a formal process for that ;) [13:13:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:01] Urbanecm: Today around 00:00 UTC I was thinking of making a time change to the "UTC late backport window", the idea is to change this implementation to 22:00 UTC in order to make it easier to have a deployer [13:16:23] Juan_90264: it's actually under informal discussion already :)) [13:17:20] Okay [13:18:10] (03PS3) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 [13:18:42] Urbanecm: Even though it is already under discussion, what do you think of the idea? [13:19:25] without moving anything else, it'd mean you'd have two deployment windows in three hours only [13:20:04] so it'd need being a bit more careful in designing the new schedule, a lot of the windows at https://wikitech.wikimedia.org/wiki/Deployments depend on each other [13:20:40] Urbanecm: But at least there would be some deployer available at this time, while the current time [13:21:46] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:22:16] (03PS1) 10MVernon: profile::thanos::swift: add account for research datasets poc [puppet] - 10https://gerrit.wikimedia.org/r/737913 (https://phabricator.wikimedia.org/T294380) [13:23:48] (03PS6) 10JMeybohm: Add basic ingress support to chart common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) [13:23:52] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:29:49] (03PS1) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 [13:31:15] (03PS1) 10Elukey: Revert "Restore previous TLS settings for Kafka test" [puppet] - 10https://gerrit.wikimedia.org/r/737703 [13:33:14] (03PS3) 10MSantos: kartographer: enable tegola in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722605 (https://phabricator.wikimedia.org/T291178) [13:33:39] (03CR) 10Elukey: [C: 03+2] Revert "Restore previous TLS settings for Kafka test" [puppet] - 10https://gerrit.wikimedia.org/r/737703 (owner: 10Elukey) [13:33:42] (03PS1) 10MVernon: profile::thanos::swift: fake creds for research_poc [labs/private] - 10https://gerrit.wikimedia.org/r/737915 (https://phabricator.wikimedia.org/T294380) [13:34:49] (03PS2) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 [13:36:26] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons. - elukey@cumin1001 [13:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:03] (03PS10) 10Arturo Borrero Gonzalez: wmcs: add openstack network tests cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/737667 (https://phabricator.wikimedia.org/T294955) [13:38:12] 10SRE-swift-storage, 10Patch-For-Review: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380 (10MatthewVernon) a:03MatthewVernon Hi, >>! In T294380#7490495, @fkaelin wrote: > Please let me know what the next steps are. I need to make some credentials (which I... [13:38:41] (03CR) 10Arturo Borrero Gonzalez: wmcs: add openstack network tests cookbook (0310 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/737667 (https://phabricator.wikimedia.org/T294955) (owner: 10Arturo Borrero Gonzalez) [13:40:40] (03CR) 10jerkins-bot: [V: 04-1] wmcs: add openstack network tests cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/737667 (https://phabricator.wikimedia.org/T294955) (owner: 10Arturo Borrero Gonzalez) [13:45:32] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:46:35] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons. - elukey@cumin1001 [13:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:02] (03PS11) 10Arturo Borrero Gonzalez: wmcs: add openstack network tests cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/737667 (https://phabricator.wikimedia.org/T294955) [13:47:37] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [13:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:01] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001 [13:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:44] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:50:44] (03CR) 10jerkins-bot: [V: 04-1] wmcs: add openstack network tests cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/737667 (https://phabricator.wikimedia.org/T294955) (owner: 10Arturo Borrero Gonzalez) [13:54:23] (03PS12) 10Arturo Borrero Gonzalez: wmcs: add openstack network tests cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/737667 (https://phabricator.wikimedia.org/T294955) [13:57:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: add openstack network tests cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/737667 (https://phabricator.wikimedia.org/T294955) (owner: 10Arturo Borrero Gonzalez) [13:57:36] (03CR) 10Btullis: [C: 03+2] Sync db1108's my.cnf settings with analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/736780 (https://phabricator.wikimedia.org/T279440) (owner: 10Ottomata) [13:59:08] (03CR) 10MSantos: [C: 03+1] maps: Add cronjob to send tile invalidation events [puppet] - 10https://gerrit.wikimedia.org/r/737898 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [14:00:43] (03Merged) 10jenkins-bot: wmcs: add openstack network tests cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/737667 (https://phabricator.wikimedia.org/T294955) (owner: 10Arturo Borrero Gonzalez) [14:01:29] (03CR) 10MSantos: tile-pregeneration: Revert batching for performance reasons (031 comment) [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/737888 (owner: 10Jgiannelos) [14:02:42] (03CR) 10Marostegui: [C: 03+2] wmnet: Replace m5-master with dbproxy1017 [dns] - 10https://gerrit.wikimedia.org/r/737837 (https://phabricator.wikimedia.org/T288093) (owner: 10Marostegui) [14:02:55] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:34] PROBLEM - Check systemd state on db1108 is CRITICAL: CRITICAL - degraded: The following units failed: mariadb@analytics_meta.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:38] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001 [14:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:15] (03CR) 10Volans: [C: 04-1] "Possible unwanted behaviour inline" [puppet] - 10https://gerrit.wikimedia.org/r/736949 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [14:09:53] !log restarted mailman3/mailman3-web to pick up new DNS for m5-master [14:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:10] (03CR) 10David Caro: [C: 03+2] ceph::client::rbd_glance: remove keyring generation [puppet] - 10https://gerrit.wikimedia.org/r/737860 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [14:12:43] (03PS1) 10Elukey: Move kafka-test1006 to PKI-based broker certs [puppet] - 10https://gerrit.wikimedia.org/r/737920 (https://phabricator.wikimedia.org/T291905) [14:14:02] (03CR) 10Elukey: [C: 03+2] Move kafka-test1006 to PKI-based broker certs [puppet] - 10https://gerrit.wikimedia.org/r/737920 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [14:15:02] (03CR) 10David Caro: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 2 DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32331/console" [puppet] - 10https://gerrit.wikimedia.org/r/737860 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [14:17:17] (03PS3) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 (https://phabricator.wikimedia.org/T229397) [14:18:11] (03CR) 10Jgiannelos: tile-pregeneration: Revert batching for performance reasons (031 comment) [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/737888 (owner: 10Jgiannelos) [14:24:33] (03CR) 10Ema: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32332/console" [puppet] - 10https://gerrit.wikimedia.org/r/737755 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:29:46] (03CR) 10Ema: [V: 03+1] varnish: Mimick XFF behaviour with UDS + PROXY protocol (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737755 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:31:48] (03PS9) 10Ema: varnish: add varnish::logging::mtail [puppet] - 10https://gerrit.wikimedia.org/r/737424 (https://phabricator.wikimedia.org/T293879) [14:34:29] ACKNOWLEDGEMENT - Check systemd state on db1108 is CRITICAL: CRITICAL - degraded: The following units failed: mariadb@analytics_meta.service Marostegui T295312 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:27] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops: upgrade/replace VRTS (formerly ORTS) buster to bullseye - https://phabricator.wikimedia.org/T295416 (10akosiaris) +1 on the general concept and actions. Some more information on the `some magic to import data if needed` part: Not really much is... [14:36:48] 10SRE-swift-storage, 10Patch-For-Review: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380 (10fkaelin) Hi, Thanks for getting this started, no worries about the delay for review. I presume I don't have the private puppet repository, I haven't been involved i... [14:38:17] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) Finally kafka-test1006 is running with a PKI kafka intermediate cert, and the rest of the cluster works... [14:39:04] (03PS8) 10Ottomata: [WIP] declare airflow-dags/analytics scap source/target for airflow analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) [14:39:45] (03CR) 10jerkins-bot: [V: 04-1] [WIP] declare airflow-dags/analytics scap source/target for airflow analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [14:45:02] (03CR) 10EllenR: Set up beta test environment for QuickSurvey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [14:45:47] (03PS4) 10EllenR: Set up beta test environment for QuickSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) [14:47:32] (03PS4) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 [14:47:55] (03PS1) 10Elukey: kafkatee:instance: change TLS CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/737923 (https://phabricator.wikimedia.org/T291905) [14:50:26] (03CR) 10Elukey: "elukey@cumin1001:~$ sudo cumin 'r:kafkatee::instance'" [puppet] - 10https://gerrit.wikimedia.org/r/737923 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [14:51:00] (03PS1) 10David Caro: openstack::control: enable ceph::auth::load for codfw [puppet] - 10https://gerrit.wikimedia.org/r/737925 (https://phabricator.wikimedia.org/T293752) [14:51:35] (03PS1) 10Ladsgroup: idp: Allow wmf or nda have access to orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/737926 (https://phabricator.wikimedia.org/T265990) [14:51:41] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/737923 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [14:51:46] (03CR) 10jerkins-bot: [V: 04-1] openstack::control: enable ceph::auth::load for codfw [puppet] - 10https://gerrit.wikimedia.org/r/737925 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [14:59:05] (03PS1) 10David Caro: ceph::auth Add glance keys [labs/private] - 10https://gerrit.wikimedia.org/r/737928 [15:01:04] (03CR) 10David Caro: [C: 03+2] ceph::auth Add glance keys [labs/private] - 10https://gerrit.wikimedia.org/r/737928 (owner: 10David Caro) [15:01:21] (03CR) 10David Caro: [V: 03+2 C: 03+2] ceph::auth Add glance keys [labs/private] - 10https://gerrit.wikimedia.org/r/737928 (owner: 10David Caro) [15:03:25] (03PS15) 10Giuseppe Lavagetto: mediawiki::php: support multiple php version in monitoring too [puppet] - 10https://gerrit.wikimedia.org/r/736949 (https://phabricator.wikimedia.org/T293450) [15:03:27] (03PS11) 10Giuseppe Lavagetto: mediawiki: add support for multiple versions in the web configuration [puppet] - 10https://gerrit.wikimedia.org/r/737330 (https://phabricator.wikimedia.org/T293450) [15:03:29] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: report prometheus metrics for all php versions [puppet] - 10https://gerrit.wikimedia.org/r/737929 [15:04:08] (03PS1) 10Btullis: Fix the my.cnf file for the analytics_multiinstance backup DB host [puppet] - 10https://gerrit.wikimedia.org/r/737930 (https://phabricator.wikimedia.org/T295312) [15:05:15] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32337/console" [puppet] - 10https://gerrit.wikimedia.org/r/737930 (https://phabricator.wikimedia.org/T295312) (owner: 10Btullis) [15:05:45] (03PS2) 10David Caro: openstack::control: enable ceph::auth::load for codfw [puppet] - 10https://gerrit.wikimedia.org/r/737925 (https://phabricator.wikimedia.org/T293752) [15:06:20] (03PS1) 10Elukey: profile::cache::kafka::webrequest: move atskafka to new CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/737931 (https://phabricator.wikimedia.org/T291905) [15:08:14] (03CR) 10Btullis: [V: 03+1] "PCC looks good. Self-approving." [puppet] - 10https://gerrit.wikimedia.org/r/737930 (https://phabricator.wikimedia.org/T295312) (owner: 10Btullis) [15:08:17] (03CR) 10Btullis: [V: 03+1 C: 03+2] Fix the my.cnf file for the analytics_multiinstance backup DB host [puppet] - 10https://gerrit.wikimedia.org/r/737930 (https://phabricator.wikimedia.org/T295312) (owner: 10Btullis) [15:12:54] (03PS3) 10David Caro: openstack::control: enable ceph::auth::load for codfw [puppet] - 10https://gerrit.wikimedia.org/r/737925 (https://phabricator.wikimedia.org/T293752) [15:14:26] (03CR) 10Ema: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32339/console" [puppet] - 10https://gerrit.wikimedia.org/r/737931 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [15:16:48] (03PS5) 10Jhernandez: Set up beta test environment for QuickSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [15:17:16] (03CR) 10Legoktm: mediawiki::php: report prometheus metrics for all php versions (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/737929 (owner: 10Giuseppe Lavagetto) [15:17:22] (03CR) 10EllenR: Set up beta test environment for QuickSurvey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [15:17:59] (03PS1) 10Btullis: Fix a second small syntax error in the my.cnf file for db1108 [puppet] - 10https://gerrit.wikimedia.org/r/737932 (https://phabricator.wikimedia.org/T295312) [15:18:41] (03CR) 10Ema: [V: 03+1 C: 03+1] profile::cache::kafka::webrequest: move atskafka to new CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/737931 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [15:19:28] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32341/console" [puppet] - 10https://gerrit.wikimedia.org/r/737932 (https://phabricator.wikimedia.org/T295312) (owner: 10Btullis) [15:21:16] (03CR) 10Btullis: [V: 03+1 C: 03+2] Fix a second small syntax error in the my.cnf file for db1108 [puppet] - 10https://gerrit.wikimedia.org/r/737932 (https://phabricator.wikimedia.org/T295312) (owner: 10Btullis) [15:22:54] (03PS4) 10David Caro: openstack::control: remove other old keyrings and enable for codfw [puppet] - 10https://gerrit.wikimedia.org/r/737925 (https://phabricator.wikimedia.org/T293752) [15:23:34] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:24:38] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32342/console" [puppet] - 10https://gerrit.wikimedia.org/r/737925 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [15:24:58] RECOVERY - Check systemd state on db1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:44] (03CR) 10Kormat: [C: 03+1] idp: Allow wmf or nda have access to orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/737926 (https://phabricator.wikimedia.org/T265990) (owner: 10Ladsgroup) [15:27:05] (03PS2) 10Ottomata: Add gitlab support for scap_source [puppet] - 10https://gerrit.wikimedia.org/r/737764 (https://phabricator.wikimedia.org/T295380) [15:27:07] (03PS9) 10Ottomata: [WIP] declare airflow-dags/analytics scap source/target for airflow analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) [15:27:56] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 7 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32343/console" [puppet] - 10https://gerrit.wikimedia.org/r/737925 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [15:28:08] (03CR) 10jerkins-bot: [V: 04-1] [WIP] declare airflow-dags/analytics scap source/target for airflow analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [15:29:09] (03CR) 10David Caro: [V: 03+1 C: 03+2] "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1001/32343/" [puppet] - 10https://gerrit.wikimedia.org/r/737925 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [15:30:14] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32344/console" [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [15:32:18] (03PS10) 10Ottomata: Declare airflow-dags scap for analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) [15:32:54] Searching for files on Commons is currently impossible, I believe this is quite critical given the whole point of Commons is being a file repository [15:33:14] (03CR) 10jerkins-bot: [V: 04-1] Declare airflow-dags scap for analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [15:33:15] T295480 [15:33:16] T295480: Searching for files on Commons returns error - https://phabricator.wikimedia.org/T295480 [15:34:19] (03CR) 10Eigyan: "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [15:35:42] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32345/console" [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [15:36:07] Dylsss: hey. some people are looking into it [15:36:14] thanks for filing the task [15:41:56] (03PS1) 10David Caro: ceph::control: enable auth deploy on eqiad and remove unused vars [puppet] - 10https://gerrit.wikimedia.org/r/737936 (https://phabricator.wikimedia.org/T293752) [15:43:57] (03CR) 10jerkins-bot: [V: 04-1] ceph::control: enable auth deploy on eqiad and remove unused vars [puppet] - 10https://gerrit.wikimedia.org/r/737936 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [15:44:11] (03CR) 10David Caro: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32346/console" [puppet] - 10https://gerrit.wikimedia.org/r/737925 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [15:46:41] (03PS1) 10Ebernhardson: Move CirrusSearch read traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737938 (https://phabricator.wikimedia.org/T295480) [15:47:41] going to be pushing a mediawiki-config patch in a few minutes, unless someone else is deploying? [15:47:45] (03CR) 10DCausse: [C: 03+1] Move CirrusSearch read traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737938 (https://phabricator.wikimedia.org/T295480) (owner: 10Ebernhardson) [15:47:51] (03PS2) 10Jgiannelos: tile-pregeneration: Re-introduce batching for performance reasons [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/737888 [15:48:24] (03PS2) 10Ebernhardson: Move CirrusSearch read traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737938 (https://phabricator.wikimedia.org/T295480) [15:48:32] (03CR) 10Ebernhardson: [C: 03+2] Move CirrusSearch read traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737938 (https://phabricator.wikimedia.org/T295480) (owner: 10Ebernhardson) [15:50:40] (03Merged) 10jenkins-bot: Move CirrusSearch read traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737938 (https://phabricator.wikimedia.org/T295480) (owner: 10Ebernhardson) [15:52:16] (03PS1) 10JMeybohm: admin_ng: Add helmfile for cert-manager and cfssl-issuer [deployment-charts] - 10https://gerrit.wikimedia.org/r/737939 (https://phabricator.wikimedia.org/T294560) [15:52:36] (03CR) 10Giuseppe Lavagetto: mediawiki::php: report prometheus metrics for all php versions (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/737929 (owner: 10Giuseppe Lavagetto) [15:52:47] !log ebernhardson@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T295480: Move all cirrussearch traffic to codfw (duration: 00m 56s) [15:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:51] T295480: Searching for files on Commons returns error - https://phabricator.wikimedia.org/T295480 [15:53:04] (03PS11) 10Ottomata: Declare airflow-dags scap for analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) [15:53:26] (03PS2) 10Giuseppe Lavagetto: mediawiki::php: report prometheus metrics for all php versions [puppet] - 10https://gerrit.wikimedia.org/r/737929 [15:53:38] >reporting indices as both existing and not existing [15:53:39] Schrödinger's search index [15:54:06] (03PS1) 10Jgiannelos: tile-pregeneration: Force overwriting tiles on cache operations [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/737941 [15:54:27] (03CR) 10jerkins-bot: [V: 04-1] Declare airflow-dags scap for analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [15:54:52] (03CR) 10JMeybohm: admin_ng: Add helmfile for cert-manager and cfssl-issuer (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/737939 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [15:55:30] (03PS6) 10Vgutierrez: varnish: Mimick XFF behaviour with UDS + PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/737755 (https://phabricator.wikimedia.org/T290005) [15:55:44] (03CR) 10Vgutierrez: varnish: Mimick XFF behaviour with UDS + PROXY protocol (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737755 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:55:57] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32348/console" [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [15:56:11] (03CR) 10MSantos: [C: 03+2] tile-pregeneration: Re-introduce batching for performance reasons [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/737888 (owner: 10Jgiannelos) [15:57:58] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32349/console" [puppet] - 10https://gerrit.wikimedia.org/r/737755 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:58:32] (03Merged) 10jenkins-bot: tile-pregeneration: Re-introduce batching for performance reasons [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/737888 (owner: 10Jgiannelos) [16:00:00] I am getting database is in read-only message when trying to edit on commonswiki by the way [16:00:21] (03PS2) 10David Caro: ceph::control: enable auth deploy on eqiad and remove unused vars [puppet] - 10https://gerrit.wikimedia.org/r/737936 (https://phabricator.wikimedia.org/T293752) [16:01:27] Dylsss: works for me [16:04:49] (03CR) 10Vgutierrez: [V: 03+1] "vgutierrez@carrot:~/wikimedia.org/operations/puppet/modules/varnish/files/tests$ ./docker_run.sh cp4027.ulsfo.wmnet 737755" [puppet] - 10https://gerrit.wikimedia.org/r/737755 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [16:05:40] err, nevermind, I had X-Wikimedia-Debug set to mwdebug2001.codfw.wmnet [16:10:54] (03PS2) 10MSantos: tile-pregeneration: Force overwriting tiles on cache operations [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/737941 (owner: 10Jgiannelos) [16:11:56] (03CR) 10Ema: [C: 03+1] varnish: Mimick XFF behaviour with UDS + PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/737755 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [16:12:44] !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host dns6001.wikimedia.org with OS buster [16:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:54] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host dns6001.wikimedia.org with OS buster [16:14:43] !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host dns6002.wikimedia.org with OS buster [16:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:54] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host dns6002.wikimedia.org with OS buster [16:16:59] (03PS16) 10Giuseppe Lavagetto: mediawiki::php: support multiple php version in monitoring too [puppet] - 10https://gerrit.wikimedia.org/r/736949 (https://phabricator.wikimedia.org/T293450) [16:17:01] (03PS12) 10Giuseppe Lavagetto: mediawiki: add support for multiple versions in the web configuration [puppet] - 10https://gerrit.wikimedia.org/r/737330 (https://phabricator.wikimedia.org/T293450) [16:17:03] (03PS3) 10Giuseppe Lavagetto: mediawiki::php: report prometheus metrics for all php versions [puppet] - 10https://gerrit.wikimedia.org/r/737929 [16:19:50] (03CR) 10Hashar: role: system::role for all mediawiki roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730004 (owner: 10Hashar) [16:20:14] (03CR) 10MSantos: [C: 03+2] tile-pregeneration: Force overwriting tiles on cache operations [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/737941 (owner: 10Jgiannelos) [16:21:15] (03CR) 10MSantos: [C: 03+1] tegola-vector-tiles: Revert cronjob schedule for codfw to default [deployment-charts] - 10https://gerrit.wikimedia.org/r/737886 (owner: 10Jgiannelos) [16:21:46] (03Merged) 10jenkins-bot: tile-pregeneration: Force overwriting tiles on cache operations [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/737941 (owner: 10Jgiannelos) [16:22:10] (03CR) 10Elukey: [C: 03+2] kafkatee:instance: change TLS CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/737923 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [16:25:59] (03PS13) 10Giuseppe Lavagetto: mediawiki: add support for multiple versions in the web configuration [puppet] - 10https://gerrit.wikimedia.org/r/737330 (https://phabricator.wikimedia.org/T293450) [16:26:03] (03PS4) 10Giuseppe Lavagetto: mediawiki::php: report prometheus metrics for all php versions [puppet] - 10https://gerrit.wikimedia.org/r/737929 [16:26:09] One more patch going out to mediawiki-config in a minute [16:26:13] (03PS1) 10Ebernhardson: Actually move CirrusSearch read traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737945 (https://phabricator.wikimedia.org/T295480) [16:26:32] (03CR) 10DCausse: [C: 03+1] Actually move CirrusSearch read traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737945 (https://phabricator.wikimedia.org/T295480) (owner: 10Ebernhardson) [16:26:39] (03PS2) 10Ebernhardson: Actually move CirrusSearch read traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737945 (https://phabricator.wikimedia.org/T295480) [16:26:43] !log move kafkatee instances (analytics-test,centralog) to the new CA bundle - T291905 [16:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:47] T291905: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 [16:26:56] (03CR) 10Ebernhardson: [C: 03+2] Actually move CirrusSearch read traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737945 (https://phabricator.wikimedia.org/T295480) (owner: 10Ebernhardson) [16:27:06] (03CR) 10Elukey: [C: 03+2] profile::cache::kafka::webrequest: move atskafka to new CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/737931 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [16:28:36] (03Merged) 10jenkins-bot: Actually move CirrusSearch read traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737945 (https://phabricator.wikimedia.org/T295480) (owner: 10Ebernhardson) [16:28:40] !log move atskafka to the new CA bundle - T291905 [16:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:30] (03PS1) 10Hashar: gitlab: turn on Content-Security-Policy [puppet] - 10https://gerrit.wikimedia.org/r/737968 (https://phabricator.wikimedia.org/T285363) [16:32:37] !log ebernhardson@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T295480: Move all cirrussearch traffic to codfw (duration: 00m 55s) [16:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:40] T295480: Searching for files on Commons returns error - https://phabricator.wikimedia.org/T295480 [16:33:01] (03CR) 10Hashar: "Reports can be checked via https://logstash.wikimedia.org/app/dashboards#/view/AW0h61hZZKA7RpiroFmS?_a=(query:(term:(source.keyword:gitlab" [puppet] - 10https://gerrit.wikimedia.org/r/737968 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [16:33:26] (03CR) 10Mepps: [C: 04-1] Set up beta test environment for QuickSurvey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [16:34:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:15] (03CR) 10EllenR: Set up beta test environment for QuickSurvey (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [16:43:48] (03CR) 10Ayounsi: [C: 03+1] Revert the temporary change that was made for transfer.py [homer/public] - 10https://gerrit.wikimedia.org/r/737906 (https://phabricator.wikimedia.org/T295312) (owner: 10Btullis) [16:44:15] (03CR) 10EllenR: Set up beta test environment for QuickSurvey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [16:44:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:03] 10SRE, 10Analytics, 10Traffic-Icebox, 10Patch-For-Review: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10elukey) While checking atskafka logs I found something interesting: ` elukey@cp3050:~$ sudo journalctl -u atskafka-webrequest.serv... [16:47:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:50] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns6002.wikimedia.org with OS buster [16:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:09] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Revert cronjob schedule for codfw to default [deployment-charts] - 10https://gerrit.wikimedia.org/r/737886 (owner: 10Jgiannelos) [16:50:13] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host dns6002.wikimedia.org with OS buster executed with errors: - dns600... [16:52:46] (03PS1) 10Elukey: Move coal and navtiming to the new CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/737970 (https://phabricator.wikimedia.org/T291905) [16:54:36] (03Merged) 10jenkins-bot: tegola-vector-tiles: Revert cronjob schedule for codfw to default [deployment-charts] - 10https://gerrit.wikimedia.org/r/737886 (owner: 10Jgiannelos) [16:56:36] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:58:44] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:59:23] (03PS2) 10Elukey: Move coal, navtiming and statsv to the new CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/737970 (https://phabricator.wikimedia.org/T291905) [17:03:35] (03Abandoned) 10Eigyan: wmf-config: Restore logging for mediamoderation script to better understand high error rate occurring when running script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708832 (https://phabricator.wikimedia.org/T287511) (owner: 10Eigyan) [17:04:32] (03CR) 10Dave Pifke: [C: 04-1] "/etc/ssl/localcerts/wmf_trusted_root_CAs.pem does not exist in deployment-prep (but does seem present in prod):" [puppet] - 10https://gerrit.wikimedia.org/r/737970 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [17:06:11] (03CR) 10Elukey: Move coal, navtiming and statsv to the new CA bundle (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737970 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [17:07:09] 10SRE, 10serviceops, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), 10Test-Coverage: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10Daimona) 05Invalid→03Open This is actually still relevant... [17:09:03] We had a Cirrussearch incident a couple hours ago where our Elasticsearch index that holds entries for all commons files - `commonswiki_file_*` was deleted and recreated (without its data) in the active production cluster (`eqiad`). [17:09:13] As a result all commons file searches failed (https://phabricator.wikimedia.org/T295478); furthermore any cross-cluster searches on, say, Wikipedia failed as well (if the wiki in question chooses to surface Commons-related results in the sidebar). For example English wiki wasn't impacted because that community disables Commons-related results, whereas a wiki like French wiki would have been impacted. [17:09:23] Here's logs of all the related failures (for those w/ access): https://logstash.wikimedia.org/goto/73a9d7e35f409c0d122888d42df94761 [17:09:35] There's no more user impact currently given that we've successfully switched over to our backup cluster (`codfw`). We'll need to restore the `commonswiki_file` index to eqiad before we can switch back to the normal cluster. We'll have an actual post-mortem incident doc up later but just wanted to summarize the basic situation right now [17:10:23] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns6001.wikimedia.org with OS buster [17:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:33] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host dns6001.wikimedia.org with OS buster executed with errors: - dns600... [17:12:15] (03CR) 10Dave Pifke: [C: 04-1] "Deleting that directory and running `puppet agent -t` did cause the CA bundle to be created." [puppet] - 10https://gerrit.wikimedia.org/r/737970 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [17:14:59] (03PS1) 10EllenR: Set up beta test environment for QuickSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737971 (https://phabricator.wikimedia.org/T293798) [17:15:37] (03CR) 10EllenR: Set up beta test environment for QuickSurvey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [17:20:39] 10SRE, 10serviceops, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), 10Test-Coverage: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10hashar) [17:26:58] (03CR) 10EllenR: Set up beta test environment for QuickSurvey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [17:29:39] (03PS2) 10JMeybohm: admin_ng: Add helmfile for cert-manager and cfssl-issuer [deployment-charts] - 10https://gerrit.wikimedia.org/r/737939 (https://phabricator.wikimedia.org/T294560) [17:29:41] (03PS1) 10JMeybohm: admin_ng: Fix templates being rendered as string [deployment-charts] - 10https://gerrit.wikimedia.org/r/737974 [17:29:43] (03PS1) 10JMeybohm: dmin_ng: Create Certificates for ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/737975 (https://phabricator.wikimedia.org/T295385) [17:29:49] (03PS2) 10JMeybohm: admin_ng: Create Certificates for ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/737975 (https://phabricator.wikimedia.org/T295385) [17:30:40] (03CR) 10Btullis: [C: 03+2] Update the times at which refine_sanitize monitor jobs are run (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737650 (owner: 10Btullis) [17:45:34] (03CR) 10Ahmon Dancy: [C: 04-1] "I will find a different approach" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737468 (https://phabricator.wikimedia.org/T295310) (owner: 10Ahmon Dancy) [17:45:43] (03Abandoned) 10Ahmon Dancy: CommonSettings.php: Only write to /tmp/mw-cache-* if running as www-data user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737468 (https://phabricator.wikimedia.org/T295310) (owner: 10Ahmon Dancy) [17:46:44] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] varnish: Mimick XFF behaviour with UDS + PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/737755 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [17:57:43] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops: upgrade/replace VRTS (formerly ORTS) buster to bullseye - https://phabricator.wikimedia.org/T295416 (10Dzahn) p:05Triage→03Medium Sounds great, thanks for that, Alex. [17:57:47] (03PS1) 10Ahmon Dancy: Get rid of obsolete train-versions.json file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737976 [17:58:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:58:22] jouncebot now [17:58:22] No deployments scheduled for the next 1 hour(s) and 1 minute(s) [17:58:35] (03CR) 10Ahmon Dancy: [C: 03+2] Get rid of obsolete train-versions.json file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737976 (owner: 10Ahmon Dancy) [17:59:23] (03Merged) 10jenkins-bot: Get rid of obsolete train-versions.json file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737976 (owner: 10Ahmon Dancy) [18:03:37] !log dancy@deploy1002 Started scap: Config: [[gerrit:737976|Get rid of obsolete train-versions.json file]] [18:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:21] (03CR) 10Vgutierrez: [C: 03+2] cache:haproxy: Enable hitless reloads [puppet] - 10https://gerrit.wikimedia.org/r/737902 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [18:04:34] (03PS3) 10Vgutierrez: cache:haproxy: Enable hitless reloads [puppet] - 10https://gerrit.wikimedia.org/r/737902 (https://phabricator.wikimedia.org/T290005) [18:07:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:01] !log restart haproxy on cp4026 and cp5006 to enable hitless reloads - T290005 [18:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:03] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [18:09:29] !log drmrs - rebooting a bunch of hosts to bios for further settings, please ignore any accidental alerts - they do *look* like they're alert-disabled) [18:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:53] user@cumin1001 Test message (volans) [18:16:00] that was me ^^^ [18:16:18] PROBLEM - haproxy alive on cp4026 is CRITICAL: CRITICAL check_alive invalid response https://wikitech.wikimedia.org/wiki/HAProxy [18:16:26] PROBLEM - haproxy alive on cp5006 is CRITICAL: CRITICAL check_alive invalid response https://wikitech.wikimedia.org/wiki/HAProxy [18:17:02] (03PS1) 10Dzahn: deployment-prep: remove scholarships app section [puppet] - 10https://gerrit.wikimedia.org/r/737977 (https://phabricator.wikimedia.org/T243037) [18:17:41] thanks for letting us know about the alerts, both, ack [18:18:23] ^^ that's me [18:18:26] * vgutierrez checking [18:19:31] heh,it's separate,ok [18:19:34] !log dancy@deploy1002 Finished scap: Config: [[gerrit:737976|Get rid of obsolete train-versions.json file]] (duration: 15m 57s) [18:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:36] hmmm icinga wants to access the haproxy stats socket without further privileges [18:20:37] :/ [18:21:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:58] (03CR) 10Majavah: [C: 03+1] "This doesn't appear to be deployed on beta, and I don't see a host list for (former) beta cluster hosts in the scap repo. No related VMs s" [puppet] - 10https://gerrit.wikimedia.org/r/737977 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [18:23:44] ACKNOWLEDGEMENT - haproxy alive on cp4026 is CRITICAL: CRITICAL check_alive invalid response Valentin Gutierrez permissions issue on /run/haproxy/haproxy.sock https://wikitech.wikimedia.org/wiki/HAProxy [18:23:44] ACKNOWLEDGEMENT - haproxy alive on cp5006 is CRITICAL: CRITICAL check_alive invalid response Valentin Gutierrez permissions issue on /run/haproxy/haproxy.sock https://wikitech.wikimedia.org/wiki/HAProxy [18:24:20] (03PS1) 10Dzahn: trafficserver: remove scholarships.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/737979 (https://phabricator.wikimedia.org/T243037) [18:25:13] (03PS2) 10Dzahn: trafficserver: remove scholarships.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/737979 (https://phabricator.wikimedia.org/T243037) [18:37:27] !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host dns6001.wikimedia.org with OS buster [18:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:34] !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host dns6002.wikimedia.org with OS buster [18:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:40] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host dns6001.wikimedia.org with OS buster [18:37:48] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host dns6002.wikimedia.org with OS buster [18:38:59] (03PS1) 10Vgutierrez: cache:haproxy: Restore stats socket permissions [puppet] - 10https://gerrit.wikimedia.org/r/737981 (https://phabricator.wikimedia.org/T290005) [18:41:16] (03PS1) 10Dzahn: wikimania_scholarships: let the module start removing itself [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037) [18:41:34] (03CR) 10Vgutierrez: [C: 03+2] cache:haproxy: Restore stats socket permissions [puppet] - 10https://gerrit.wikimedia.org/r/737981 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [18:41:36] (03CR) 10Elukey: Move coal, navtiming and statsv to the new CA bundle (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737970 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [18:42:19] (03CR) 10jerkins-bot: [V: 04-1] wikimania_scholarships: let the module start removing itself [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [18:44:01] (03CR) 10Dzahn: "unless.. you would like to first add a redirect for it to send people somewhere else" [puppet] - 10https://gerrit.wikimedia.org/r/737979 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [18:45:22] 10SRE, 10serviceops, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), 10Test-Coverage: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10Legoktm) a:03Legoktm Sure. [18:47:40] (03PS3) 10Elukey: Move coal, navtiming and statsv to the new CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/737970 (https://phabricator.wikimedia.org/T291905) [18:47:42] (03PS1) 10Elukey: profile::base::certificates: vary trusted_certs on realm [puppet] - 10https://gerrit.wikimedia.org/r/737983 (https://phabricator.wikimedia.org/T291905) [18:48:15] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Metrics: wmflib.prometheus: add support for thanos backend - https://phabricator.wikimedia.org/T295498 (10Volans) p:05Triage→03Medium [18:48:51] (03CR) 10jerkins-bot: [V: 04-1] profile::base::certificates: vary trusted_certs on realm [puppet] - 10https://gerrit.wikimedia.org/r/737983 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [18:51:30] (03CR) 10Majavah: [C: 04-1] "this will not be suitable for deployment-prep: https://wikitech.wikimedia.org/wiki/PKI/Cloud" [puppet] - 10https://gerrit.wikimedia.org/r/737983 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [18:52:04] RECOVERY - haproxy alive on cp4026 is OK: OK check_alive uptime 415s https://wikitech.wikimedia.org/wiki/HAProxy [18:56:12] (03PS1) 10Vgutierrez: haproxy: Enable hitless reload [puppet] - 10https://gerrit.wikimedia.org/r/737984 [18:57:44] !log removing mediawiki font packages from parsoid hosts - T294378 [18:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:47] T294378: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 [18:58:05] majavah: the change is still wip, I am trying to make it work for deployment-prep [18:58:34] RECOVERY - haproxy alive on cp5006 is OK: OK check_alive uptime 337s https://wikitech.wikimedia.org/wiki/HAProxy [19:00:05] Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211110T1900) [19:00:05] RoanKattouw and Urbanecm: That opportune time is upon us again. Time for a UTC evening backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211110T1900). [19:00:05] mbsantos and cjming: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:32] (03CR) 10Dzahn: [V: 03+1 C: 03+2] parsoid: remove mediawiki font packages from all parsoid servers [puppet] - 10https://gerrit.wikimedia.org/r/737800 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [19:00:40] here! thanks [19:01:23] elukey: awesome, thanks! /me just making sure it doesn't get ignore and break things [19:01:42] majavah: nono don't worry :) [19:01:52] PROBLEM - Host fe80::185:15:58:5 is DOWN: CRITICAL - Destination Unreachable (fe80::185:15:58:5) [19:02:05] huh?? [19:02:09] fe80, heh [19:02:18] pretty sure it's work in the new DC though [19:03:01] ah, I just had a quick glance at it and thought it was on the cloud ranges but apparently not [19:03:05] (03PS1) 10Andrew Bogott: WMCS haproxy: set expose-fd listeners for all services [puppet] - 10https://gerrit.wikimedia.org/r/737986 [19:03:14] yea, that's dns6001, the v4 IP that is in there [19:03:40] it's installs in drmrs [19:05:19] (03CR) 10jerkins-bot: [V: 04-1] WMCS haproxy: set expose-fd listeners for all services [puppet] - 10https://gerrit.wikimedia.org/r/737986 (owner: 10Andrew Bogott) [19:06:08] PROBLEM - Host fe80::185:15:58:37 is DOWN: CRITICAL - Destination Unreachable (fe80::185:15:58:37) [19:06:21] (03PS2) 10Andrew Bogott: WMCS haproxy: set expose-fd listeners for all services [puppet] - 10https://gerrit.wikimedia.org/r/737986 [19:07:33] (03CR) 10Vgutierrez: "please note that this will require a service restart to be applied" [puppet] - 10https://gerrit.wikimedia.org/r/737986 (owner: 10Andrew Bogott) [19:08:35] (03PS2) 10Elukey: profile::base::certificates: vary trusted_certs on realm [puppet] - 10https://gerrit.wikimedia.org/r/737983 (https://phabricator.wikimedia.org/T291905) [19:08:37] (03PS4) 10Elukey: Move coal, navtiming and statsv to the new CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/737970 (https://phabricator.wikimedia.org/T291905) [19:09:22] (03PS1) 10EllenR: Set up beta test environment for QuickSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737987 (https://phabricator.wikimedia.org/T293798) [19:10:50] PROBLEM - Recursive DNS on 185.15.58.5 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [19:11:06] PROBLEM - Recursive DNS on 185.15.58.37 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [19:11:06] RoanKattouw: or urbanecm: is the backport window still happening? [19:11:49] (03CR) 10Elukey: "Let's try to find a good compromise between flexibility and maintainability for deployment-prep :)" [puppet] - 10https://gerrit.wikimedia.org/r/737983 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [19:11:58] bblack: related to the reimages ^^^ [19:12:59] volans: yes, certainly [19:13:22] I think they got far enough to get service alerts configured via puppet, but they're not far enough for the services to be working yet [19:13:22] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:35] (and since these are by-ip for DNS stuff, they're not automatically just "part of the host's services") [19:14:01] maybe they should have a dependency in icinga to the host, if possible [19:14:15] but yeah they are not catched by the default downtime [19:14:28] ACKNOWLEDGEMENT - Recursive DNS on 185.15.58.37 is CRITICAL: DNS_QUERY CRITICAL - query timed out Brandon Black Temporary issue from imaging new DNS servers, not in production critical path anywhere https://wikitech.wikimedia.org/wiki/DNS [19:14:28] ACKNOWLEDGEMENT - Recursive DNS on 185.15.58.5 is CRITICAL: DNS_QUERY CRITICAL - query timed out Brandon Black Temporary issue from imaging new DNS servers, not in production critical path anywhere https://wikitech.wikimedia.org/wiki/DNS [19:14:34] we could add a way to tell the script to downtime additional hosts/services, not sure if worthed [19:14:47] (03PS1) 10Volans: constants: add new drmrs datacenter [software/pywmflib] - 10https://gerrit.wikimedia.org/r/737989 (https://phabricator.wikimedia.org/T282787) [19:14:49] (03PS1) 10Volans: interactive: change input prefix to ==> [software/pywmflib] - 10https://gerrit.wikimedia.org/r/737990 [19:14:51] it would seem better to find ways to just attach them to hosts properly [19:14:51] (03PS1) 10Volans: docs: add examples to all modules [software/pywmflib] - 10https://gerrit.wikimedia.org/r/737991 [19:14:53] (03PS1) 10Volans: constants: add CORE_DATACENTERS constant [software/pywmflib] - 10https://gerrit.wikimedia.org/r/737992 [19:14:59] I think in this particular case, they can be, they just aren't [19:15:29] (they'd have to be attached to the host in icinga, but still have a custom IP for the check command to avoid dns problems testing dns) [19:15:42] (or something like that, I assume is why they are the way they are) [19:17:02] I think make them their own virtual host and then tell Icinga the other real host is their parent [19:17:18] hmmm yeah, good point [19:17:46] PROBLEM - Host fe80::185:15:58:5 is DOWN: CRITICAL - Destination Unreachable (fe80::185:15:58:5) [19:19:42] heh [19:20:01] I'm guessing that's the temporary ipv6 link-local getting picked up by (icinga?) as the host's primary ipv6 [19:20:05] it's a strange output for sure :) [19:20:29] presumably it will fix itself after sufficient agent runs + reboots [19:20:34] yea, unusual to see :) but when I resolved the IPv4 inside that I could tell it was dns6 [19:20:45] cjming: hey, I'm here now if i can still help :) [19:20:59] But afaik you're a deployer, and more than welcomed to self-service :) [19:21:46] ah ok - thanks urbanecm: wasn't sure about protocols here [19:22:34] PROBLEM - Host fe80::185:15:58:37 is DOWN: CRITICAL - Destination Unreachable (fe80::185:15:58:37) [19:22:40] mbsantos: do you want to go first? [19:23:04] cjming: so i take it you'll take the window now :). I'm here and laptop-available if needed [19:23:10] !log uploaded php-pcov_1.0.6-4+wmf1~buster1_amd64.changes to apt.wm.o (T243847) [19:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:13] T243847: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 [19:23:26] 10SRE, 10serviceops, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), 10Test-Coverage: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10Legoktm) 05Open→03Resolved [19:23:54] urbanecm: thanks! I can give it a go if you don't mind being on standby in case something goes haywire [19:24:02] Not at all :) [19:24:07] I would rather someone doing mine, I've only deployed configs once 🙃 [19:24:44] (03PS4) 10Clare Ming: kartographer: enable tegola in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722605 (https://phabricator.wikimedia.org/T291178) (owner: 10MSantos) [19:24:50] mbsantos: then i will give it a whirl and start with yours [19:25:08] (03CR) 10Clare Ming: [C: 03+2] kartographer: enable tegola in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722605 (https://phabricator.wikimedia.org/T291178) (owner: 10MSantos) [19:25:41] (03PS3) 10Ottomata: Add gitlab support for scap_source [puppet] - 10https://gerrit.wikimedia.org/r/737764 (https://phabricator.wikimedia.org/T295380) [19:26:54] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:26:55] cjming: thanks! [19:27:40] (03Merged) 10jenkins-bot: kartographer: enable tegola in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722605 (https://phabricator.wikimedia.org/T291178) (owner: 10MSantos) [19:27:58] cjming: note I won't likely follow the scrollback as i would when deploying, please ping me if you need me :)) [19:28:25] urbanecm: will do -- just to confirm mwdebug1002 is the correct test server right? [19:28:34] that, or mwdebug1001 [19:28:50] (03CR) 10Thcipriani: [C: 03+1] Add gitlab support for scap_source [puppet] - 10https://gerrit.wikimedia.org/r/737764 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [19:28:59] both're equal [19:29:36] (03CR) 10Ottomata: [C: 03+2] Add gitlab support for scap_source [puppet] - 10https://gerrit.wikimedia.org/r/737764 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [19:30:18] (03PS1) 10Volans: Adopt pathlib.Path [software/spicerack] - 10https://gerrit.wikimedia.org/r/737993 [19:30:32] mbsantos: can you check mwdebug1002? [19:31:40] (03PS2) 10Clare Ming: Lower mobile web click tracking rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737814 (https://phabricator.wikimedia.org/T295432) (owner: 10Jdlrobson) [19:32:32] sure, what do you want me to check there? scap log? cjming [19:33:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:11] mbsantos: just to confirm your change is looking good - then i'll sync [19:33:49] cjming lgtm [19:33:59] cool - then syncing live now [19:35:15] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:737814|Lower mobile web click tracking rate (T295432)]] (duration: 00m 57s) [19:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:26] T295432: Lower sampling rate for MobileWebUIClickTracking on English Wikipedia before wmf8 is on English Wikipedia - https://phabricator.wikimedia.org/T295432 [19:36:18] mbsantos: your change should be live now [19:36:21] (03CR) 10jerkins-bot: [V: 04-1] Adopt pathlib.Path [software/spicerack] - 10https://gerrit.wikimedia.org/r/737993 (owner: 10Volans) [19:36:28] cjming: all good from here! [19:36:29] Thanks! [19:36:34] np! [19:36:43] (03CR) 10Clare Ming: [C: 03+2] Lower mobile web click tracking rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737814 (https://phabricator.wikimedia.org/T295432) (owner: 10Jdlrobson) [19:36:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:33] (03Merged) 10jenkins-bot: Lower mobile web click tracking rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737814 (https://phabricator.wikimedia.org/T295432) (owner: 10Jdlrobson) [19:37:45] (03CR) 10Volans: "The mypy failure is due to the Depends-On patch that has been merged but not yet deployed. Once deployed to PyPI it will fix CI." [software/spicerack] - 10https://gerrit.wikimedia.org/r/737993 (owner: 10Volans) [19:41:20] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:737814|Lower mobile web click tracking rate (T295432)]] (duration: 00m 55s) [19:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:23] T295432: Lower sampling rate for MobileWebUIClickTracking on English Wikipedia before wmf8 is on English Wikipedia - https://phabricator.wikimedia.org/T295432 [19:42:16] RECOVERY - Recursive DNS on 185.15.58.5 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [19:42:22] RECOVERY - Recursive DNS on 185.15.58.37 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [19:42:24] urbanecm: thanks for standing by - all the patches are merged so I'm going to go ahead and close the backport window [19:42:37] excellent! Thanks for leading the window today :) [19:42:53] np + thank you! [19:42:57] !log end of UTC late backport & config window [19:42:58] any time [19:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:06] it's not UTC late though :) [19:43:16] oh - whoops [19:43:49] !log end of UTC evening backport & config window [19:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:06] (03CR) 10Dzahn: "It seems this broke deployment servers in cloud VPS. (T294174)" [puppet] - 10https://gerrit.wikimedia.org/r/723419 (owner: 10Giuseppe Lavagetto) [19:44:06] (y) [19:45:00] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:46:54] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:46:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:43] !log altering {eqiad,codfw}.maps.tiles_change to increase to 6 partitions in kafka main-eqiad, main-codfw and jumbo-eqiad: https://phabricator.wikimedia.org/T293366#7497076 [19:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:57] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns6002.wikimedia.org with OS buster [19:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:01] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns6001.wikimedia.org with OS buster [19:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:07] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host dns6002.wikimedia.org with OS buster completed: - dns6002 (**WARN**... [19:52:10] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host dns6001.wikimedia.org with OS buster completed: - dns6001 (**WARN**... [19:53:08] PROBLEM - Host 2a02:ec80:600:1:185:15:58:5 is DOWN: PING CRITICAL - Packet loss = 100% [19:53:08] PROBLEM - Host 2a02:ec80:600:2:185:15:58:37 is DOWN: PING CRITICAL - Packet loss = 100% [19:55:38] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:55:42] (03PS12) 10Ottomata: Declare airflow-dags scap for analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) [19:56:18] (03CR) 10jerkins-bot: [V: 04-1] Declare airflow-dags scap for analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [19:57:34] (03PS1) 10Dzahn: cloud/devtools: sync Horizon Hiera with repo Hiera [puppet] - 10https://gerrit.wikimedia.org/r/737997 (https://phabricator.wikimedia.org/T294174) [19:59:01] (03CR) 10Dzahn: [C: 03+2] "let's avoid having conflicting values in web UI" [puppet] - 10https://gerrit.wikimedia.org/r/737997 (https://phabricator.wikimedia.org/T294174) (owner: 10Dzahn) [20:01:06] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:07:57] (03PS13) 10Ottomata: Declare airflow-dags scap for analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) [20:08:34] (03CR) 10jerkins-bot: [V: 04-1] Declare airflow-dags scap for analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [20:10:47] (03PS14) 10Ottomata: Declare airflow-dags scap for analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) [20:11:20] (03CR) 10jerkins-bot: [V: 04-1] Declare airflow-dags scap for analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [20:12:16] (03PS15) 10Ottomata: Declare airflow-dags scap for analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) [20:15:04] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:15:06] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32353/console" [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [20:16:50] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:17:31] (03PS1) 10BBlack: drmrs: configure dns servers for general use [puppet] - 10https://gerrit.wikimedia.org/r/738001 (https://phabricator.wikimedia.org/T282787) [20:18:58] (03CR) 10SBassett: [C: 03+1] gitlab: turn on Content-Security-Policy [puppet] - 10https://gerrit.wikimedia.org/r/737968 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [20:19:17] (03CR) 10BBlack: [C: 03+2] drmrs: configure dns servers for general use [puppet] - 10https://gerrit.wikimedia.org/r/738001 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [20:21:06] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:23:55] (03PS1) 10Dzahn: cloud/devtools: fix puppet run on deploy-1002, add missing kafka/zookeeper keys [puppet] - 10https://gerrit.wikimedia.org/r/738002 (https://phabricator.wikimedia.org/T294174) [20:24:13] ^ checking the uncommitted DNS stuff [20:24:30] (03CR) 10Dzahn: [C: 03+2] cloud/devtools: fix puppet run on deploy-1002, add missing kafka/zookeeper keys [puppet] - 10https://gerrit.wikimedia.org/r/738002 (https://phabricator.wikimedia.org/T294174) (owner: 10Dzahn) [20:37:21] (03PS16) 10Ottomata: Declare airflow-dags scap for analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) [20:37:53] (03CR) 10jerkins-bot: [V: 04-1] Declare airflow-dags scap for analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [20:39:00] (03PS17) 10Ottomata: Declare airflow-dags scap for analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) [20:40:09] on a VM in cloud, ferm gets 'DNS query ..failed: SERVFAIL' but on the same machine I can use host, dig or ping and they all can resolve that.. wut [20:41:28] error message please? [20:41:51] oh.. i just found profile::resolving::domain_search: is set to eqiad.wmflabs there [20:42:03] majavah: ferm[25717]: DNS query for 'prometheus01.metricsinfra.eqiad.wmflabs' failed: SERVFAIL for example [20:42:25] let me adjust the resolving::domain_search to cloud instead of labs [20:42:37] (03PS18) 10Ottomata: Declare airflow-dags scap for analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) [20:44:58] PROBLEM - Disk space on webperf2002 is CRITICAL: DISK CRITICAL - free space: /srv 11196 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf2002&var-datasource=codfw+prometheus/ops [20:45:21] -search devtools.eqiad.wmflabs eqiad.wmflabs codfw.wmflabs [20:45:21] +search devtools.eqiad.wmflabs eqiad1.wikimedia.cloud [20:45:28] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32355/console" [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [20:45:34] this helped, fixing that in Hiera in the repo now [20:46:32] (03PS19) 10Ottomata: Declare airflow-dags scap for analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) [20:48:04] (03PS1) 10Dzahn: cloud/devtools: fix resolv.conf search path (wmflabs->wikimedia.cloud) [puppet] - 10https://gerrit.wikimedia.org/r/738005 (https://phabricator.wikimedia.org/T294174) [20:48:46] PROBLEM - Disk space on webperf1002 is CRITICAL: DISK CRITICAL - free space: /srv 11365 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf1002&var-datasource=eqiad+prometheus/ops [20:49:41] (03PS2) 10Dzahn: cloud/devtools: fix resolv.conf search path (wmflabs->wikimedia.cloud) [puppet] - 10https://gerrit.wikimedia.org/r/738005 (https://phabricator.wikimedia.org/T294174) [20:50:19] (03PS3) 10Dzahn: cloud/devtools: fix resolv.conf search path (wmflabs->wikimedia.cloud) [puppet] - 10https://gerrit.wikimedia.org/r/738005 (https://phabricator.wikimedia.org/T294174) [20:50:43] (03CR) 10Dzahn: [C: 03+2] cloud/devtools: fix resolv.conf search path (wmflabs->wikimedia.cloud) [puppet] - 10https://gerrit.wikimedia.org/r/738005 (https://phabricator.wikimedia.org/T294174) (owner: 10Dzahn) [20:51:22] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:51:59] Amir1: Krinkle not sure who to ping about webperf1002 disk filling up [20:52:07] it looks like lots of xenon (arclamp?) logs [20:52:20] mutante: can't you just unset it on the project level to let it use cloud-wide defaults? [20:52:28] stuff going back to 2020-06 [20:53:24] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:08] (03PS20) 10Ottomata: Declare airflow-dags scap for analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) [20:56:25] majavah: at least at the time of creating this I don't think so, but we can try if now [20:57:22] 10SRE, 10Arc-Lamp, 10Performance-Team, 10serviceops: webperf*002 running out of disk space (arc lamp, xhgui) - https://phabricator.wikimedia.org/T235425 (10Dzahn) 05Resolved→03Open [20:57:25] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32357/console" [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [20:57:31] ottomata: that should be dpifke , I reopened that task above [20:57:52] same thing, just webperfs ending in 2 [20:58:23] https://phabricator.wikimedia.org/T235425#5573454 [20:59:05] 10SRE, 10Arc-Lamp, 10Performance-Team, 10serviceops: webperf*002 running out of disk space (arc lamp, xhgui) - https://phabricator.wikimedia.org/T235425 (10Dzahn) This is happening again. The webperf*2 hosts are alerting in Icinga about disk space. [20:59:35] (03PS21) 10Ottomata: Declare airflow-dags scap for analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) [20:59:36] Looking now. [21:00:05] chrisalbon and accraze: Time to snap out of that daydream and deploy Services – Graphoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211110T2100). [21:00:41] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737190 (https://phabricator.wikimedia.org/T223602) (owner: 10Awight) [21:01:18] ottomata: the storing of those files is the server's primary purpose (not "log" files) [21:01:26] ok [21:01:29] glad i asked :) [21:02:59] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32358/console" [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [21:03:39] 10SRE, 10Arc-Lamp, 10Performance-Team, 10serviceops: webperf*002 running out of disk space (arc lamp, xhgui) - https://phabricator.wikimedia.org/T235425 (10Krinkle) a:05Krinkle→03None [21:06:08] (03CR) 10Awight: "Very minor thing: it looks like this patch got duplicated a bit during squashing. See I8c38d73eef1c and I01f15ede1f, maybe these others s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [21:09:44] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:11:12] 10SRE, 10Arc-Lamp, 10Performance-Team, 10serviceops: webperf*002 running out of disk space (arc lamp, xhgui) - https://phabricator.wikimedia.org/T235425 (10dpifke) a:03dpifke This is arguably a new issue, unrelated to the last time. I don't see anything obviously wrong with the jobs to compress and even... [21:11:50] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:16:42] RECOVERY - Host 2a02:ec80:600:1:185:15:58:5 is UP: PING OK - Packet loss = 0%, RTA = 85.23 ms [21:17:45] (03PS1) 10Dave Pifke: arclamp: compress logs after 3 days, not 7 [puppet] - 10https://gerrit.wikimedia.org/r/738010 (https://phabricator.wikimedia.org/T235425) [21:20:26] Took a first swing at the incident doc for today's ~2.5 hour partial cirrussearch outage: https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-10_cirrussearch_commonsfile_outage#Summary [21:21:17] All commons file searches failed, as well as Special:Search for many wikis [but notably not English wikipedia], but Wikipedia searches that used the top right "go box" (how most users search for wiki articles) were not impacted [21:23:21] Dylsss: ^ see above for incident report for today's cirrussearch issue. thanks for making the initial ticket, as well as promptly notifying us in this channel [21:24:04] !log asw1-b1[23]-drmrs: added ipv6 router-advertisement clauses, which work, but probably imperfectly :) [21:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:10] RECOVERY - Host 2a02:ec80:600:2:185:15:58:37 is UP: PING OK - Packet loss = 0%, RTA = 85.24 ms [21:26:16] (03CR) 10Awight: Set up beta test environment for QuickSurvey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [21:27:06] PROBLEM - Disk space on webperf2002 is CRITICAL: DISK CRITICAL - free space: /srv 11014 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf2002&var-datasource=codfw+prometheus/ops [21:28:02] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:29:00] (03PS5) 10Dzahn: snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074 [21:29:20] (03CR) 10jerkins-bot: [V: 04-1] snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [21:30:52] PROBLEM - Disk space on webperf1002 is CRITICAL: DISK CRITICAL - free space: /srv 11223 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf1002&var-datasource=eqiad+prometheus/ops [21:34:37] (03CR) 10Dzahn: [C: 03+2] arclamp: compress logs after 3 days, not 7 [puppet] - 10https://gerrit.wikimedia.org/r/738010 (https://phabricator.wikimedia.org/T235425) (owner: 10Dave Pifke) [21:35:15] 10SRE, 10Arc-Lamp, 10Performance-Team, 10serviceops, 10Patch-For-Review: webperf*002 running out of disk space (arc lamp, xhgui) - https://phabricator.wikimedia.org/T235425 (10Krinkle) From [Grafana: Host overview](https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&orgId=1&var-server=we... [21:40:53] (03PS22) 10Ottomata: Declare airflow-dags scap for analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) [21:42:19] (03PS1) 10Dzahn: webperf::arclamp: turn number of days until compressing logs into a parameter [puppet] - 10https://gerrit.wikimedia.org/r/738013 (https://phabricator.wikimedia.org/T235425) [21:44:44] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32359/console" [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [21:53:50] !log dns1001 - restart ntp.service to see if drmrs associations cleared up after dns changes, etc [21:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:48] !log dns2001 - restart ntp.service to fix drmrs peering [21:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:00] (03PS2) 10Dave Pifke: webperf::arclamp: turn log compression & expiry into parameters [puppet] - 10https://gerrit.wikimedia.org/r/738013 (https://phabricator.wikimedia.org/T235425) (owner: 10Dzahn) [22:00:41] (03CR) 10jerkins-bot: [V: 04-1] webperf::arclamp: turn log compression & expiry into parameters [puppet] - 10https://gerrit.wikimedia.org/r/738013 (https://phabricator.wikimedia.org/T235425) (owner: 10Dzahn) [22:01:14] !log dns1002 - restart ntp.servce to fix drmrs peering [22:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:25] (03CR) 10Dave Pifke: "Good catch! Strongly agree these should be more obviously tunable. I applied your idea to retention as well." [puppet] - 10https://gerrit.wikimedia.org/r/738013 (https://phabricator.wikimedia.org/T235425) (owner: 10Dzahn) [22:02:12] (03CR) 10Ottomata: [V: 03+1] "Okay! Latest PCC looks good." [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [22:02:27] (03PS1) 10Cwhite: nagios_common: add team-product-analytics contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/738016 (https://phabricator.wikimedia.org/T295381) [22:02:49] (03PS3) 10Dave Pifke: webperf::arclamp: turn log compression & expiry into parameters [puppet] - 10https://gerrit.wikimedia.org/r/738013 (https://phabricator.wikimedia.org/T235425) (owner: 10Dzahn) [22:06:31] (03PS4) 10Dave Pifke: webperf::arclamp: turn log compression & expiry into parameters [puppet] - 10https://gerrit.wikimedia.org/r/738013 (https://phabricator.wikimedia.org/T235425) (owner: 10Dzahn) [22:06:31] !log dns2002 - restart ntp.servce to fix drmrs peering [22:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:20] (03PS5) 10Dave Pifke: webperf::arclamp: turn log compression & expiry into parameters [puppet] - 10https://gerrit.wikimedia.org/r/738013 (https://phabricator.wikimedia.org/T235425) (owner: 10Dzahn) [22:10:36] (03CR) 10Cwhite: [C: 03+2] nagios_common: add team-product-analytics contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/738016 (https://phabricator.wikimedia.org/T295381) (owner: 10Cwhite) [22:16:40] (03CR) 10Dave Pifke: [C: 03+1] "PCC output looks good: https://puppet-compiler.wmflabs.org/compiler1002/32362/" [puppet] - 10https://gerrit.wikimedia.org/r/738013 (https://phabricator.wikimedia.org/T235425) (owner: 10Dzahn) [22:16:45] (03PS4) 10Cwhite: statistics::product_analytics: Update contact group for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/736916 (https://phabricator.wikimedia.org/T295381) (owner: 10Bearloga) [22:17:10] (03PS5) 10Cwhite: statistics::product_analytics: Update contact group for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/736916 (https://phabricator.wikimedia.org/T295381) (owner: 10Bearloga) [22:25:13] (03CR) 10Cwhite: [C: 03+2] statistics::product_analytics: Update contact group for monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736916 (https://phabricator.wikimedia.org/T295381) (owner: 10Bearloga) [22:26:20] (03PS1) 10Bartosz Dziewoński: Configure upload dialog on officewiki to upload locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738021 (https://phabricator.wikimedia.org/T295510) [22:26:36] (03CR) 10Awight: [C: 03+1] "Modernized! Looks safe to merge." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE)) [22:28:59] RECOVERY - Disk space on webperf2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf2002&var-datasource=codfw+prometheus/ops [22:32:11] RECOVERY - Disk space on webperf1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf1002&var-datasource=eqiad+prometheus/ops [22:35:17] (03CR) 10BryanDavis: [C: 03+1] trafficserver: remove scholarships.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737979 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [22:37:48] (03CR) 10BryanDavis: [C: 03+1] deployment-prep: remove scholarships app section [puppet] - 10https://gerrit.wikimedia.org/r/737977 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [22:58:23] PROBLEM - Check systemd state on webperf2002 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_compress_logs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:58:40] (03CR) 10Dzahn: [C: 03+2] webperf::arclamp: turn log compression & expiry into parameters [puppet] - 10https://gerrit.wikimedia.org/r/738013 (https://phabricator.wikimedia.org/T235425) (owner: 10Dzahn) [23:03:24] (03CR) 10Awight: [C: 03+1] "Great!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737857 (owner: 10Thiemo Kreuz (WMDE)) [23:05:20] (03CR) 10Dzahn: "thank you! deployed and ran puppet on webperf* and nothing happened :)" [puppet] - 10https://gerrit.wikimedia.org/r/738013 (https://phabricator.wikimedia.org/T235425) (owner: 10Dzahn) [23:06:15] (03CR) 10Dzahn: [C: 03+2] deployment-prep: remove scholarships app section [puppet] - 10https://gerrit.wikimedia.org/r/737977 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [23:07:24] (03CR) 10Awight: "I have a few patches open focusing on this class, but our goals were quite different. I suppose mine should be merged after yours since t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737858 (owner: 10Thiemo Kreuz (WMDE)) [23:07:33] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1146.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:09:48] (03CR) 10Dave Pifke: "Before:" [puppet] - 10https://gerrit.wikimedia.org/r/738010 (https://phabricator.wikimedia.org/T235425) (owner: 10Dave Pifke) [23:11:23] 10SRE, 10Arc-Lamp, 10Performance-Team, 10serviceops: webperf*002 running out of disk space (arc lamp, xhgui) - https://phabricator.wikimedia.org/T235425 (10dpifke) 05Open→03Resolved Before reducing compression age threshold: ` Filesystem Size Used Avail Use% Mounted on /dev/vdb 295G 268G... [23:11:48] (03CR) 10Dzahn: arclamp: compress logs after 3 days, not 7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738010 (https://phabricator.wikimedia.org/T235425) (owner: 10Dave Pifke) [23:13:10] (03PS1) 10Dzahn: remove scholarships.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/738028 (https://phabricator.wikimedia.org/T243037) [23:17:23] RECOVERY - Check systemd state on webperf2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:22:59] PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:23:03] (03PS6) 10Dzahn: snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074 [23:23:39] (03CR) 10jerkins-bot: [V: 04-1] snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [23:23:51] (03PS7) 10Dzahn: snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074 [23:33:07] !log [urbanecm@mwmaint1002 ~]$ mwscript updateSpecialPages.php --wiki=foundationwiki --only=BrokenRedirects [23:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:14] !log [urbanecm@mwmaint1002 ~]$ mwscript updateSpecialPages.php --wiki=foundationwiki --only=DoubleRedirects [23:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:30] (03PS1) 10Dzahn: mediawiki: remove font packages from API appservers [puppet] - 10https://gerrit.wikimedia.org/r/738031 (https://phabricator.wikimedia.org/T294378) [23:38:47] PROBLEM - MariaDB Replica Lag: s6 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1330.79 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:40:28] (03CR) 10Dzahn: "fixed in devtools project (outside deployment-prep), there are not many projects with their own deployment_server I guess but it's a thing" [puppet] - 10https://gerrit.wikimedia.org/r/723419 (owner: 10Giuseppe Lavagetto) [23:44:09] RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:45:09] RECOVERY - MariaDB Replica Lag: s6 on db2141 is OK: OK slave_sql_lag Replication lag: 1.67 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:46:55] !log start test backup/restore of 1tb commonswiki from relforge to swift in eqiad [23:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:37] PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:54:45] RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica