[00:00:30] (03PS11) 10Jforrester: Add optimised square logo and wordmark for Wikimania on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [00:01:22] (03CR) 10Jforrester: "I've squashed and re-written these moderately-aggressively, so they're now more readable and less horrendous in general. I think this is g" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [00:02:47] (03PS1) 10Tim Starling: Faster mailing list construction, exclusion list [extensions/SecurePoll] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713855 [00:03:02] (03CR) 10Tim Starling: [C: 03+2] Faster mailing list construction, exclusion list [extensions/SecurePoll] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713855 (owner: 10Tim Starling) [00:03:53] (03Merged) 10jenkins-bot: Faster mailing list construction, exclusion list [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/713854 (owner: 10Tim Starling) [00:07:00] (03Merged) 10jenkins-bot: Faster mailing list construction, exclusion list [extensions/SecurePoll] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713855 (owner: 10Tim Starling) [00:08:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:22] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.19/extensions/SecurePoll/includes/User/LocalAuth.php: hack for mailout (duration: 00m 58s) [00:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:42] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.19/extensions/SecurePoll/cli/wm-scripts/makeMailingList.php: code that uses said hack (duration: 00m 57s) [00:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:10] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10thcipriani) >>! In T289259#7294999, @RobH wrote: > @thcipriani, > > This is one of three current requests to add a new wmf emp... [00:21:16] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10thcipriani) >>! In T289258#7295002, @RobH wrote: > @thcipriani, > > This is one of three current requests to add a new wmf employee... [00:21:43] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Chmielko Maslak - https://phabricator.wikimedia.org/T289257 (10thcipriani) >>! In T289257#7295005, @RobH wrote: > @thcipriani, > > This is one of three current requests to add a new wmf empl... [01:19:06] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10wkandek) [01:29:32] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10wkandek) In T288375 we are discussing how this extension would have access to the maxmind databases when we migrate MediaWiki to Kubern... [01:34:49] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Huji) My understanding is that the changes in the data are minimal from one version to the next; it is not like the ownership of hundre... [01:37:40] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:05:01] (03CR) 10RLazarus: [C: 03+1] "The approach LGTM pending Keith's comments on implementation -- thanks for doing this!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713440 (https://phabricator.wikimedia.org/T289036) (owner: 10Ema) [02:13:32] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Legoktm) I've only noticed one issue so far, Scores written in ABC are failing: {T289298}. I'll deploy the f... [02:18:46] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 76 probes of 617 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:24:40] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 41 probes of 617 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:49:57] (03PS1) 10Legoktm: removeTagline: Set explicit pcre.backtrack_limit [extensions/Score] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713856 (https://phabricator.wikimedia.org/T289298) [02:50:09] (03CR) 10Legoktm: [C: 03+2] removeTagline: Set explicit pcre.backtrack_limit [extensions/Score] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713856 (https://phabricator.wikimedia.org/T289298) (owner: 10Legoktm) [03:08:55] (03Merged) 10jenkins-bot: removeTagline: Set explicit pcre.backtrack_limit [extensions/Score] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713856 (https://phabricator.wikimedia.org/T289298) (owner: 10Legoktm) [03:11:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [03:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:12:06] hmm, the X-Wikimedia-Debug browser extension is broken for me in the latest Firefox [03:12:18] it doesn't bring up the host dropdown [03:12:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [03:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:15:33] !log legoktm@deploy1002 Synchronized php-1.37.0-wmf.19/extensions/Score/scripts/removeTagline.php: removeTagline: Set explicit pcre.backtrack_limit (T289298) (duration: 00m 58s) [03:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:15:41] T289298: Scores written in ABC failing with "PCRE regular expression replacement failed" - https://phabricator.wikimedia.org/T289298 [03:34:55] (03PS4) 10Krinkle: Remove putenv() for GDFONTPATH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664670 (https://phabricator.wikimedia.org/T274822) (owner: 10Legoktm) [03:35:00] (03CR) 10Krinkle: [C: 03+1] Remove putenv() for GDFONTPATH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664670 (https://phabricator.wikimedia.org/T274822) (owner: 10Legoktm) [03:35:44] (03PS2) 10Krinkle: Drop $wmgUseScoreShellbox, redundant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713727 (owner: 10Legoktm) [03:35:47] (03CR) 10Krinkle: [C: 03+1] Drop $wmgUseScoreShellbox, redundant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713727 (owner: 10Legoktm) [03:54:50] (03CR) 10Juan90264: Add optimised square logo and wordmark for Wikimania on mobile (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [04:49:46] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2056-production-search-omega-codfw on elastic2056 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-omega-codfw&var-instance=elastic2056&panelId=37 [05:42:52] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2056-production-search-omega-codfw on elastic2056 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-omega-codfw&var-instance=elastic2056&panelId=37 [05:50:58] (03PS1) 10Jgiannelos: tegola-vector-tiles: Enable pregeneration job on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/713975 [05:59:08] (03PS2) 10Jgiannelos: tegola-vector-tiles: Enable pregeneration job on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/713975 (https://phabricator.wikimedia.org/T283159) [06:05:46] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: Enable pregeneration job on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/713975 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [06:07:46] !log sending election email to 44k people [06:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:19] (03Merged) 10jenkins-bot: tegola-vector-tiles: Enable pregeneration job on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/713975 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [06:13:30] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [06:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:08] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-akhatun-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:24:58] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:25:24] TimStarling: I got the email message, and my bot account did too - was that intended? [06:25:42] what is the bot username? [06:25:53] Legobot [06:26:51] via ceb.wikipedia apparently [06:27:12] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 93.40% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [06:27:17] no that is not intended [06:28:20] if more data points would be useful, I had one with Dexbot as well [06:29:30] no, probably failed for everyone [06:33:52] I didn't get one for MajavahBoy [06:34:02] s/Boy/Bot [06:36:40] yeah, I found the mistake [06:37:45] if there's a wiki where your bot is attached but it doesn't have bot permissions there, it will get mailed [06:38:04] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2056-production-search-omega-codfw on elastic2056 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-omega-codfw&var-instance=elastic2056&panelId=37 [06:38:24] because the feature I wrote to fix that was only written yesterday and was not properly tested [06:40:56] if your bot doesn't have bot permissions anywhere then it would have gotten mailed anyway -- bot means bot rights as far as qualifications go [06:41:23] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30724/console" [puppet] - 10https://gerrit.wikimedia.org/r/713815 (https://phabricator.wikimedia.org/T288815) (owner: 10Filippo Giunchedi) [06:42:29] anyway, please do not vote with your bot account [06:47:35] 10SRE, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) [06:50:10] 10SRE, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mc2019.codfw.wmnet', 'mc1037.eqiad.wmnet'... [06:51:06] (03CR) 10H.krishna123: "Great, it does work now. My turn to get the config files in order now 😄" [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [06:51:34] (03CR) 10H.krishna123: "recheck" [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [06:53:02] (03CR) 10H.krishna123: "Yep, I seem to be able to execute it." [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210820T0700) [07:01:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [07:06:34] PROBLEM - Aggregate IPsec Tunnel Status codfw on alert1001 is CRITICAL: instance=mc2024 site=codfw tunnel=mc1038_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [07:06:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [07:08:33] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1037.eqiad.wmnet with reason: REIMAGE [07:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:40] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2019.codfw.wmnet with reason: REIMAGE [07:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:29] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1038.eqiad.wmnet with reason: REIMAGE [07:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:46] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1037.eqiad.wmnet with reason: REIMAGE [07:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:58] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1038.eqiad.wmnet with reason: REIMAGE [07:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:40] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mc2019.codfw.wmnet with reason: REIMAGE [07:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:08] RECOVERY - Aggregate IPsec Tunnel Status codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [07:37:18] 10SRE, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc1037.eqiad.wmnet', 'mc2019.codfw.wmnet', 'mc1038.eqiad.wmnet'] ` and were **ALL** succ... [07:56:57] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10Patch-For-Review: Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10kostajh) #release-engineering-team, could one of yo... [08:07:19] 10SRE, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) [08:09:40] 10Puppet, 10Infrastructure-Foundations, 10puppet-compiler: puppet-facts-export sometimes fails with 'trusted' fact not found - https://phabricator.wikimedia.org/T289335 (10fgiunchedi) [08:22:48] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30727/console" [puppet] - 10https://gerrit.wikimedia.org/r/713815 (https://phabricator.wikimedia.org/T288815) (owner: 10Filippo Giunchedi) [08:32:42] (03PS1) 10Btullis: Begin decommission of druid1001 [puppet] - 10https://gerrit.wikimedia.org/r/714013 (https://phabricator.wikimedia.org/T255148) [08:32:48] (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata: add new memcached servers to mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/713875 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [08:33:44] (03CR) 10Alexandros Kosiaris: [C: 03+2] url_downloader: Don't cache ICMP database [puppet] - 10https://gerrit.wikimedia.org/r/713843 (https://phabricator.wikimedia.org/T286525) (owner: 10Alexandros Kosiaris) [08:33:54] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2056-production-search-omega-codfw on elastic2056 is CRITICAL: 105.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-omega-codfw&var-instance=elastic2056&panelId=37 [08:36:14] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: add new memcached servers to mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/713875 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [08:39:40] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2056-production-search-omega-codfw on elastic2056 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-omega-codfw&var-instance=elastic2056&panelId=37 [08:40:28] (03CR) 10Btullis: [C: 03+2] Begin decommission of druid1001 [puppet] - 10https://gerrit.wikimedia.org/r/714013 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [08:43:14] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on druid1001.eqiad.wmnet with reason: decommissioning druid1001 [08:43:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on druid1001.eqiad.wmnet with reason: decommissioning druid1001 [08:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:34] !log temp depool thanos-fe2003 to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/713815 [08:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:12] 10SRE, 10Patch-For-Review: urldownloader2002 running out of disk space in root partition - https://phabricator.wikimedia.org/T286525 (10akosiaris) 05Open→03Resolved a:03akosiaris I 've disabled netdb persistency. The database still exists but in memory only. It can be inspected using `curl http://localho... [08:45:55] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] swift: prefix Bullseye pipelines with proxy-logging [puppet] - 10https://gerrit.wikimedia.org/r/713815 (https://phabricator.wikimedia.org/T288815) (owner: 10Filippo Giunchedi) [08:48:31] !log roll depool/pool thanos-fe to apply swift change - T288815 [08:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:39] T288815: Envoy and swift HEAD with 204 response turns into 503 - https://phabricator.wikimedia.org/T288815 [08:58:59] 10SRE-swift-storage, 10envoy, 10serviceops, 10Patch-For-Review: Envoy and swift HEAD with 204 response turns into 503 - https://phabricator.wikimedia.org/T288815 (10fgiunchedi) I've deployed the fix from Swift upstream and it is working (i.e. Swift DTRT and Envoy's happy). @RLazarus I believe we're okay to... [09:01:29] 10SRE, 10Cloud-VPS, 10LDAP, 10cloud-services-team (Kanban): investigate slapd memory leak - https://phabricator.wikimedia.org/T130593 (10akosiaris) 05Open→03Resolved a:03akosiaris I am just gonna re-resolve this. We haven't worked on it in 4 years, it's clearly not a priority. Should we decided to vi... [09:03:02] RECOVERY - Check no envoy runtime configuration is left persistent on thanos-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [09:13:08] RECOVERY - Check no envoy runtime configuration is left persistent on thanos-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [09:13:35] (03PS1) 10Dzahn: miscweb: override docker registry URL with discovery name [deployment-charts] - 10https://gerrit.wikimedia.org/r/714014 (https://phabricator.wikimedia.org/T281538) [09:14:43] (03CR) 10Dzahn: "..just like the others ervices are doing it and" [deployment-charts] - 10https://gerrit.wikimedia.org/r/714014 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [09:15:42] (03CR) 10Dzahn: "without it there is an error that it can't find the image in the registry" [deployment-charts] - 10https://gerrit.wikimedia.org/r/714014 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [09:16:17] (03PS2) 10Dzahn: miscweb: override docker registry URL with discovery name [deployment-charts] - 10https://gerrit.wikimedia.org/r/714014 (https://phabricator.wikimedia.org/T281538) [09:19:12] RECOVERY - Check no envoy runtime configuration is left persistent on thanos-fe2001 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [09:21:10] (03CR) 10Dzahn: [C: 03+2] miscweb: override docker registry URL with discovery name [deployment-charts] - 10https://gerrit.wikimedia.org/r/714014 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [09:21:45] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [09:21:55] (03PS1) 10Btullis: Remove references to old druid servers from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/714015 (https://phabricator.wikimedia.org/T255148) [09:23:18] !log btullis@cumin1001 START - Cookbook sre.hosts.decommission for hosts druid1001.eqiad.wmnet [09:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:42] (03Merged) 10jenkins-bot: miscweb: override docker registry URL with discovery name [deployment-charts] - 10https://gerrit.wikimedia.org/r/714014 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [09:25:09] (03CR) 10David Caro: [C: 03+1] "LGTM, related to task T280493" [puppet] - 10https://gerrit.wikimedia.org/r/713909 (owner: 10Bstorm) [09:25:37] RECOVERY - Check no envoy runtime configuration is left persistent on thanos-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [09:26:45] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [09:27:31] (03PS1) 10Effie Mouzeli: hieradata: add mc1040 and mc1042 to redis_sessions [puppet] - 10https://gerrit.wikimedia.org/r/714018 (https://phabricator.wikimedia.org/T278225) [09:32:04] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [09:32:06] (03PS2) 10Effie Mouzeli: hieradata: add mc1040 and mc1042 to redis_sessions [puppet] - 10https://gerrit.wikimedia.org/r/714018 (https://phabricator.wikimedia.org/T278225) [09:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:01] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts druid1001.eqiad.wmnet [09:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:09] (03CR) 10Btullis: [C: 03+2] Remove references to old druid servers from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/714015 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [09:39:28] (03CR) 10Dzahn: "They are also still in DHCP, if they are decom'ed they can also be removed from there (modules/install_server/files/dhcpd)" [puppet] - 10https://gerrit.wikimedia.org/r/714015 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [09:39:37] (03PS3) 10MSantos: maps: update imposm mapping [puppet] - 10https://gerrit.wikimedia.org/r/713827 (https://phabricator.wikimedia.org/T288400) [09:39:57] (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata: add mc1040 and mc1042 to redis_sessions [puppet] - 10https://gerrit.wikimedia.org/r/714018 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [09:41:22] (03CR) 10Btullis: Remove references to old druid servers from site.pp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714015 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [09:43:18] (03PS1) 10MSantos: maps: prepare for OSM re-import [puppet] - 10https://gerrit.wikimedia.org/r/714020 [09:43:27] (03PS1) 10Filippo Giunchedi: alertmanager: hide 'alertname' label [puppet] - 10https://gerrit.wikimedia.org/r/714021 (https://phabricator.wikimedia.org/T284213) [09:43:30] (03PS1) 10Dzahn: miscweb: override image, version name and set some CPU/RAM limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/714022 (https://phabricator.wikimedia.org/T255148) [09:43:41] (03CR) 10jerkins-bot: [V: 04-1] maps: prepare for OSM re-import [puppet] - 10https://gerrit.wikimedia.org/r/714020 (owner: 10MSantos) [09:43:43] (03CR) 10jerkins-bot: [V: 04-1] miscweb: override image, version name and set some CPU/RAM limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/714022 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [09:44:00] (03PS2) 10Dzahn: miscweb: override image, version name and set some CPU/RAM limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/714022 (https://phabricator.wikimedia.org/T255148) [09:44:25] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: hide 'alertname' label [puppet] - 10https://gerrit.wikimedia.org/r/714021 (https://phabricator.wikimedia.org/T284213) (owner: 10Filippo Giunchedi) [09:46:22] (03PS4) 10MSantos: maps: update imposm mapping [puppet] - 10https://gerrit.wikimedia.org/r/713827 (https://phabricator.wikimedia.org/T288400) [09:46:24] (03PS2) 10MSantos: maps: prepare for OSM re-import [puppet] - 10https://gerrit.wikimedia.org/r/714020 [09:46:37] RECOVERY - Check systemd state on maps2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:31] (03PS1) 10Btullis: Remove dummy keytabs for decommissioned druid servers [labs/private] - 10https://gerrit.wikimedia.org/r/714023 (https://phabricator.wikimedia.org/T255148) [09:52:25] (03CR) 10Jelto: [C: 03+1] "lgtm, just added two small comments" [deployment-charts] - 10https://gerrit.wikimedia.org/r/714022 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [09:55:14] (03PS1) 10Btullis: Remove a reference to druid1001 from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/714024 (https://phabricator.wikimedia.org/T255148) [09:55:51] (03CR) 10Dzahn: [C: 03+1] "ah, just one of them, I thought all :)" [puppet] - 10https://gerrit.wikimedia.org/r/714024 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [09:56:00] (03CR) 10Btullis: "Thanks for spotting this Dzahn" [puppet] - 10https://gerrit.wikimedia.org/r/714024 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [09:56:05] (03PS3) 10Effie Mouzeli: hieradata: add mc1040 and mc1042 to redis_sessions [puppet] - 10https://gerrit.wikimedia.org/r/714018 (https://phabricator.wikimedia.org/T278225) [09:59:12] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: add mc1040 and mc1042 to redis_sessions [puppet] - 10https://gerrit.wikimedia.org/r/714018 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [09:59:27] (03PS3) 10Dzahn: miscweb: override image, version name and set some CPU/RAM limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/714022 (https://phabricator.wikimedia.org/T255148) [09:59:34] (03CR) 10jerkins-bot: [V: 04-1] miscweb: override image, version name and set some CPU/RAM limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/714022 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [10:00:21] (03CR) 10Dzahn: miscweb: override image, version name and set some CPU/RAM limits (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/714022 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [10:00:28] (03PS4) 10Dzahn: miscweb: override image, version name and set some CPU/RAM limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/714022 (https://phabricator.wikimedia.org/T255148) [10:03:37] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/714022 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [10:09:34] (03CR) 10Dzahn: [C: 03+2] miscweb: override image, version name and set some CPU/RAM limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/714022 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [10:11:24] PROBLEM - Aggregate IPsec Tunnel Status eqiad on alert1001 is CRITICAL: instance={mc1020,mc1022} site=eqiad tunnel={mc2020_v4,mc2022_v4} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [10:12:01] (03Merged) 10jenkins-bot: miscweb: override image, version name and set some CPU/RAM limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/714022 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [10:13:38] ^ that is me [10:14:32] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [10:15:07] (03CR) 10Jgiannelos: [C: 03+1] maps: update imposm mapping [puppet] - 10https://gerrit.wikimedia.org/r/713827 (https://phabricator.wikimedia.org/T288400) (owner: 10MSantos) [10:15:54] (03PS1) 10Filippo Giunchedi: pontoon: add swift settings to swift stack [puppet] - 10https://gerrit.wikimedia.org/r/714025 [10:15:56] (03PS1) 10Filippo Giunchedi: pontoon: add update swift stack rolemap [puppet] - 10https://gerrit.wikimedia.org/r/714026 [10:15:58] (03PS1) 10Filippo Giunchedi: pontoon: add thanos settings to swift stack [puppet] - 10https://gerrit.wikimedia.org/r/714027 [10:16:00] (03CR) 10Jgiannelos: [C: 03+1] "nit: Just for git commit sanity maybe you want to rebase and update the git commit message since the admin boundaries have already been me" [puppet] - 10https://gerrit.wikimedia.org/r/713827 (https://phabricator.wikimedia.org/T288400) (owner: 10MSantos) [10:18:06] checking high latency in eqiad (!) [10:22:16] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30732/console" [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [10:28:24] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2056-production-search-omega-codfw on elastic2056 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-omega-codfw&var-instance=elastic2056&panelId=37 [10:29:37] (03CR) 10Dzahn: "Hi! The cumin aliases file has an entry "maps-canary: P{maps1001.eqiad.wmnet}" and we get automatic emails telling us when an alias doesn'" [puppet] - 10https://gerrit.wikimedia.org/r/713903 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [10:30:02] (03PS1) 10Zabe: Revert "Enable NewUserMessage on hiwiktionary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713860 (https://phabricator.wikimedia.org/T287091) [10:30:10] (03PS2) 10Zabe: Revert "Enable NewUserMessage on hiwiktionary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713860 (https://phabricator.wikimedia.org/T287091) [10:30:59] (03CR) 10Dzahn: "also the decom'ed maps hosts are still in DHCP (modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200) and in this (which I k" [puppet] - 10https://gerrit.wikimedia.org/r/713903 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [10:31:20] (03PS1) 10Hnowlan: cumin: fix maps canary alias [puppet] - 10https://gerrit.wikimedia.org/r/714029 (https://phabricator.wikimedia.org/T288810) [10:32:07] (03CR) 10Dzahn: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/714029 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [10:32:33] (03PS2) 10Hnowlan: cumin: fix maps canary alias + cleanup dhcp entries [puppet] - 10https://gerrit.wikimedia.org/r/714029 (https://phabricator.wikimedia.org/T288810) [10:33:04] (03CR) 10Dzahn: [C: 03+1] cumin: fix maps canary alias + cleanup dhcp entries [puppet] - 10https://gerrit.wikimedia.org/r/714029 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [10:33:07] (03CR) 10Hnowlan: site: remove decommissioned maps hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713903 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [10:33:23] (03PS2) 10Hnowlan: site: remove decommissioned maps hosts [puppet] - 10https://gerrit.wikimedia.org/r/713903 (https://phabricator.wikimedia.org/T288810) [10:33:35] (03CR) 10Dzahn: site: remove decommissioned maps hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713903 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [10:33:38] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [10:34:39] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes/staging: Enable Priority admission plugin in staging [puppet] - 10https://gerrit.wikimedia.org/r/713806 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [10:35:08] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [10:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:40] RECOVERY - Aggregate IPsec Tunnel Status eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [10:40:53] (03CR) 10Hnowlan: [C: 03+2] cumin: fix maps canary alias + cleanup dhcp entries [puppet] - 10https://gerrit.wikimedia.org/r/714029 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [10:46:31] (03PS1) 10Effie Mouzeli: hieradata: add mc1044, mc1048 and mc1050 to redis_sessions [puppet] - 10https://gerrit.wikimedia.org/r/714030 (https://phabricator.wikimedia.org/T278225) [10:47:25] (03PS1) 10Joal: Update dumps geoeditors readme back to normal [puppet] - 10https://gerrit.wikimedia.org/r/714031 [10:47:57] (03PS5) 10MSantos: maps: update imposm mapping [puppet] - 10https://gerrit.wikimedia.org/r/713827 (https://phabricator.wikimedia.org/T288400) [10:50:14] (03CR) 10MSantos: maps: update imposm mapping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713827 (https://phabricator.wikimedia.org/T288400) (owner: 10MSantos) [10:50:34] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30733/console" [puppet] - 10https://gerrit.wikimedia.org/r/713903 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [10:51:35] (03CR) 10Btullis: [C: 03+2] Update dumps geoeditors readme back to normal [puppet] - 10https://gerrit.wikimedia.org/r/714031 (owner: 10Joal) [10:52:26] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30734/console" [puppet] - 10https://gerrit.wikimedia.org/r/713903 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [10:53:01] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] site: remove decommissioned maps hosts [puppet] - 10https://gerrit.wikimedia.org/r/713903 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [10:54:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:54:18] (03PS1) 10Effie Mouzeli: hieradata: add mc1052, mc1054 to redis_sessions [puppet] - 10https://gerrit.wikimedia.org/r/714032 (https://phabricator.wikimedia.org/T278225) [10:55:19] (03PS1) 10Dzahn: miscweb: set service.deployment to production, not minikube, and port [deployment-charts] - 10https://gerrit.wikimedia.org/r/714034 (https://phabricator.wikimedia.org/T255148) [10:55:38] (03PS2) 10Dzahn: miscweb: set service.deployment to production, not minikube, and port [deployment-charts] - 10https://gerrit.wikimedia.org/r/714034 (https://phabricator.wikimedia.org/T255148) [10:55:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:57:27] (03CR) 10jerkins-bot: [V: 04-1] miscweb: set service.deployment to production, not minikube, and port [deployment-charts] - 10https://gerrit.wikimedia.org/r/714034 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [11:03:03] (03PS1) 10Jcrespo: Add dummy recovery access identities and keys for read only recoveries [labs/private] - 10https://gerrit.wikimedia.org/r/714037 (https://phabricator.wikimedia.org/T276442) [11:03:41] (03PS2) 10Jcrespo: mediabaackup:Add dummy recovery identity keys for read only recoveries [labs/private] - 10https://gerrit.wikimedia.org/r/714037 (https://phabricator.wikimedia.org/T276442) [11:04:42] (03PS3) 10Jcrespo: mediabackup:Add dummy recovery identity keys for read only recoveries [labs/private] - 10https://gerrit.wikimedia.org/r/714037 (https://phabricator.wikimedia.org/T276442) [11:05:12] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mediabackup:Add dummy recovery identity keys for read only recoveries [labs/private] - 10https://gerrit.wikimedia.org/r/714037 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [11:14:41] (03PS1) 10JMeybohm: admin_ng: Deploy a ResourceQuota to allow priority pods in kube-system [deployment-charts] - 10https://gerrit.wikimedia.org/r/714038 (https://phabricator.wikimedia.org/T289131) [11:14:57] (03CR) 10Hnowlan: [C: 03+2] maps: prepare for OSM re-import [puppet] - 10https://gerrit.wikimedia.org/r/714020 (owner: 10MSantos) [11:15:02] (03CR) 10Hnowlan: [C: 03+2] maps: update imposm mapping [puppet] - 10https://gerrit.wikimedia.org/r/713827 (https://phabricator.wikimedia.org/T288400) (owner: 10MSantos) [11:18:07] (03PS2) 10JMeybohm: admin_ng: Deploy a ResourceQuota to allow priority pods in kube-system [deployment-charts] - 10https://gerrit.wikimedia.org/r/714038 (https://phabricator.wikimedia.org/T289131) [11:19:29] (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata: add mc1044, mc1048 and mc1050 to redis_sessions [puppet] - 10https://gerrit.wikimedia.org/r/714030 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [11:20:59] (03PS4) 10Vgutierrez: envoyproxy: Remove trailing whitespace [puppet] - 10https://gerrit.wikimedia.org/r/710494 (https://phabricator.wikimedia.org/T265880) [11:21:01] (03PS4) 10Vgutierrez: envoyproxy: Support V3 configuration API [puppet] - 10https://gerrit.wikimedia.org/r/710495 (https://phabricator.wikimedia.org/T265880) [11:21:03] (03PS5) 10Vgutierrez: envoyproxy: Add prefetched OCSP staple support [puppet] - 10https://gerrit.wikimedia.org/r/710496 (https://phabricator.wikimedia.org/T271421) [11:21:05] (03PS7) 10Vgutierrez: envoyproxy: Add dual stack cert support [puppet] - 10https://gerrit.wikimedia.org/r/710507 (https://phabricator.wikimedia.org/T271421) [11:21:07] (03PS6) 10Vgutierrez: envoyproxy: Support ciphersuite configuration [puppet] - 10https://gerrit.wikimedia.org/r/710577 (https://phabricator.wikimedia.org/T271421) [11:21:09] (03PS5) 10Vgutierrez: envoyproxy: Support ECDH curves configuration [puppet] - 10https://gerrit.wikimedia.org/r/710581 (https://phabricator.wikimedia.org/T271421) [11:21:11] (03PS5) 10Vgutierrez: envoyproxy: Add upstream PROXY protocol support [puppet] - 10https://gerrit.wikimedia.org/r/711386 (https://phabricator.wikimedia.org/T271421) [11:21:13] (03PS5) 10Vgutierrez: envoyproxy: Add STEK configuration support [puppet] - 10https://gerrit.wikimedia.org/r/711399 (https://phabricator.wikimedia.org/T271421) [11:21:15] (03PS5) 10Vgutierrez: cache: Provide an envoy STEK manager script [puppet] - 10https://gerrit.wikimedia.org/r/711407 (https://phabricator.wikimedia.org/T271421) [11:21:17] (03PS6) 10Vgutierrez: envoyproxy: Provide support for UDS upstreams [puppet] - 10https://gerrit.wikimedia.org/r/712368 (https://phabricator.wikimedia.org/T271421) [11:21:19] (03PS7) 10Vgutierrez: envoyproxy: Support alpn_protocols configuration [puppet] - 10https://gerrit.wikimedia.org/r/713238 (https://phabricator.wikimedia.org/T271421) [11:21:21] (03PS7) 10Vgutierrez: envoyproxy: Support TLS min/max version config [puppet] - 10https://gerrit.wikimedia.org/r/713246 (https://phabricator.wikimedia.org/T271421) [11:21:23] (03PS6) 10Vgutierrez: envoyproxy: Allow setting a global lua script [puppet] - 10https://gerrit.wikimedia.org/r/713271 (https://phabricator.wikimedia.org/T271421) [11:21:25] (03PS6) 10Vgutierrez: cache: Use envoy lua API to provide TLS info [puppet] - 10https://gerrit.wikimedia.org/r/713272 (https://phabricator.wikimedia.org/T271421) [11:21:27] (03PS6) 10Vgutierrez: envoyproxy: Support PreserveCase HeaderKeyFormat [puppet] - 10https://gerrit.wikimedia.org/r/713460 (https://phabricator.wikimedia.org/T271421) [11:21:29] (03PS1) 10Vgutierrez: envoyproxy: Allow configuring TLS handshake timeout [puppet] - 10https://gerrit.wikimedia.org/r/714039 (https://phabricator.wikimedia.org/T271421) [11:21:41] sorry about that [11:21:44] * vgutierrez hides [11:21:59] * Lucas_WMDE glances at zuul [11:22:03] oh good [11:23:15] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service,wmf_auto_restart_cassandra-metrics-collector.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:37] I'm keeping the updates to the minimum and running utils/run_ci_locally.sh first [11:23:39] (03PS1) 10Jcrespo: mediabackup: Add new conf and identity for read only recovery [puppet] - 10https://gerrit.wikimedia.org/r/714040 (https://phabricator.wikimedia.org/T276442) [11:25:57] (03CR) 10Jcrespo: [C: 03+1] "Looks good: https://puppet-compiler.wmflabs.org/compiler1001/30735/ms-backup1001.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/714040 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [11:26:22] ACKNOWLEDGEMENT - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service,wmf_auto_restart_cassandra-metrics-collector.timer Hnowlan Disabled via puppet for reimport. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:22] (03PS2) 10Jcrespo: mediabackup: Add new conf and identity for read only recovery [puppet] - 10https://gerrit.wikimedia.org/r/714040 (https://phabricator.wikimedia.org/T276442) [11:33:29] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [11:36:14] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: add mc1044, mc1048 and mc1050 to redis_sessions [puppet] - 10https://gerrit.wikimedia.org/r/714030 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [11:36:27] (03PS2) 10Effie Mouzeli: hieradata: add mc1044, mc1048 and mc1050 to redis_sessions [puppet] - 10https://gerrit.wikimedia.org/r/714030 (https://phabricator.wikimedia.org/T278225) [11:37:05] is jenkins voting for you on your puppet patches? [11:38:15] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 24.02 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [11:38:37] oh, there is a very large puppet queue on CI [11:41:04] (03PS1) 10Jgiannelos: tegola-vector-tiles: Make staging cron run every 5 mins for debug purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/714041 [11:44:01] (03CR) 10Jcrespo: [C: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/714040 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [11:45:10] ah, it got unstack, not sure if on its own, or someone did something [11:45:34] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Add new conf and identity for read only recovery [puppet] - 10https://gerrit.wikimedia.org/r/714040 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [11:47:37] (03CR) 10David Caro: [C: 03+1] wikireplicas: remove old code for supporting monolithic replicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713495 (owner: 10Bstorm) [11:48:27] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [11:49:48] (03CR) 10Effie Mouzeli: [C: 03+1] tegola-vector-tiles: Make staging cron run every 5 mins for debug purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/714041 (owner: 10Jgiannelos) [11:50:35] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Make staging cron run every 5 mins for debug purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/714041 (owner: 10Jgiannelos) [11:53:05] (03Merged) 10jenkins-bot: tegola-vector-tiles: Make staging cron run every 5 mins for debug purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/714041 (owner: 10Jgiannelos) [11:55:12] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [11:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:48] !log enabled priority admission plugin on k8s staging, rolling restart all pods in kube-system namespace - T289131 [12:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:56] T289131: Enable the Priority admission plugin - https://phabricator.wikimedia.org/T289131 [12:01:07] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:01:19] (03PS3) 10Dzahn: miscweb: set service.deployment to production, not minikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/714034 (https://phabricator.wikimedia.org/T255148) [12:03:17] PROBLEM - Check systemd state on mc1044 is CRITICAL: CRITICAL - degraded: The following units failed: redis-instance-tcp_6379.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:29] PROBLEM - Check systemd state on mc1050 is CRITICAL: CRITICAL - degraded: The following units failed: redis-instance-tcp_6379.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:36] ^ me [12:07:41] RECOVERY - Check systemd state on mc1044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:07:55] RECOVERY - Check systemd state on mc1050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:11:18] 10SRE, 10serviceops: Run httpbb periodically - https://phabricator.wikimedia.org/T289202 (10Dzahn) > What individual hosts should we test? "All tests on each appserver" is probably more work than we need to do. We probably don't want to pick a random host every time (the behavior should be consistent, but if i... [12:15:43] (03PS1) 10Jgiannelos: tegola-vector-tiles: Add missing labels on cron pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/714044 (https://phabricator.wikimedia.org/T283159) [12:21:46] (03CR) 10Jgiannelos: "This was discovered when cronjob pods raised connection errors to other services. The network policies didn't match the pod selector and w" [deployment-charts] - 10https://gerrit.wikimedia.org/r/714044 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [12:24:21] (03PS2) 10Effie Mouzeli: hieradata: add mc1052, mc1054 to redis_sessions [puppet] - 10https://gerrit.wikimedia.org/r/714032 (https://phabricator.wikimedia.org/T278225) [12:25:52] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: add mc1052, mc1054 to redis_sessions [puppet] - 10https://gerrit.wikimedia.org/r/714032 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [12:26:11] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:31:19] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:40:50] ^ looking [12:45:13] ACKNOWLEDGEMENT - Check systemd state on mc1054 is CRITICAL: CRITICAL - degraded: The following units failed: redis-instance-tcp_6379.service Effie Mouzeli will fix itself soon https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:46:33] PROBLEM - Aggregate IPsec Tunnel Status eqiad on alert1001 is CRITICAL: instance={mc1034,mc1036} site=eqiad tunnel={mc2034_v4,mc2036_v4} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [12:49:31] RECOVERY - Aggregate IPsec Tunnel Status eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [12:50:11] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [12:55:31] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [13:11:08] 10SRE, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Create new Mailing List PRCWikimen - https://phabricator.wikimedia.org/T287083 (10Aklapper) 05Stalled→03Declined p:05Medium→03Triage Unfortunately closing this Phabricator task as no further information has been provided. @LanmeiCN: After you have pr... [13:14:09] PROBLEM - Check systemd state on mc1019 is CRITICAL: CRITICAL - degraded: The following units failed: redis-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:48] ACKNOWLEDGEMENT - Check systemd state on mc1019 is CRITICAL: CRITICAL - degraded: The following units failed: redis-server.service Effie Mouzeli ignoring, server will be retired https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:21:45] (03PS1) 10Effie Mouzeli: ProductionServices: replace redis_lock eqiad servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714049 (https://phabricator.wikimedia.org/T280582) [13:27:42] (03CR) 10Vgutierrez: envoyproxy: Support ECDH curves configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710581 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [13:32:56] (03PS1) 10Effie Mouzeli: hieradata/hosts.pp: eqiad memcached refresh cleanup [puppet] - 10https://gerrit.wikimedia.org/r/714050 (https://phabricator.wikimedia.org/T278225) [13:33:29] (03CR) 10jerkins-bot: [V: 04-1] hieradata/hosts.pp: eqiad memcached refresh cleanup [puppet] - 10https://gerrit.wikimedia.org/r/714050 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [13:34:00] (03PS2) 10Effie Mouzeli: hieradata/hosts.pp: eqiad memcached refresh cleanup [puppet] - 10https://gerrit.wikimedia.org/r/714050 (https://phabricator.wikimedia.org/T278225) [13:34:53] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:17] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: Add missing labels on cron pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/714044 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [13:38:37] (03CR) 10Majavah: Fix webservice failing with error trying to raise exception (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713938 (https://phabricator.wikimedia.org/T289177) (owner: 10Nskaggs) [13:38:56] (03Merged) 10jenkins-bot: tegola-vector-tiles: Add missing labels on cron pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/714044 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [13:39:17] (03CR) 10Nskaggs: [C: 03+1] "I like the idea of creating a task without paging for non-wake me up alerts. Could we route this to show up for the person on clinic duty " [puppet] - 10https://gerrit.wikimedia.org/r/713909 (owner: 10Bstorm) [13:40:37] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:33] (03PS1) 10Jgiannelos: tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/714052 [13:41:44] (03PS1) 10Dzahn: miscweb: define a dedicated nodePort, 4111 [deployment-charts] - 10https://gerrit.wikimedia.org/r/714053 (https://phabricator.wikimedia.org/T255148) [13:42:15] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/714052 (owner: 10Jgiannelos) [13:42:43] (03PS5) 10Btullis: Improve the Kerberos automatic renewal service [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) [13:43:47] (03PS2) 10Dzahn: miscweb: define a dedicated nodePort, 4111 [deployment-charts] - 10https://gerrit.wikimedia.org/r/714053 (https://phabricator.wikimedia.org/T255148) [13:44:42] (03Merged) 10jenkins-bot: tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/714052 (owner: 10Jgiannelos) [13:45:46] (03PS3) 10Dzahn: miscweb: define a dedicated nodePort, 4111 [deployment-charts] - 10https://gerrit.wikimedia.org/r/714053 (https://phabricator.wikimedia.org/T255148) [13:46:02] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30741/console" [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [13:46:55] (03CR) 10David Caro: [C: 03+1] Fix webservice failing with error trying to raise exception (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713938 (https://phabricator.wikimedia.org/T289177) (owner: 10Nskaggs) [13:47:15] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/714053 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [13:47:44] (03CR) 10Dzahn: [C: 03+2] miscweb: define a dedicated nodePort, 4111 [deployment-charts] - 10https://gerrit.wikimedia.org/r/714053 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [13:48:33] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [13:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:10] (03Merged) 10jenkins-bot: miscweb: define a dedicated nodePort, 4111 [deployment-charts] - 10https://gerrit.wikimedia.org/r/714053 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [13:54:02] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [13:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:00] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, and 2 others: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10Lucas_Werkmeister_WMDE) >>! In T265138#6532545, @jbond wrote: >> will move to puppet6 untill at least bullseye > Its worth noting that b... [14:00:47] (03CR) 10Nskaggs: Fix webservice failing with error trying to raise exception (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713938 (https://phabricator.wikimedia.org/T289177) (owner: 10Nskaggs) [14:03:48] (03PS2) 10Nskaggs: Fix webservice failing with error trying to raise exception [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713938 (https://phabricator.wikimedia.org/T289177) [14:05:40] (03PS3) 10Nskaggs: Fix webservice failing with error trying to raise exception [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713938 (https://phabricator.wikimedia.org/T289177) [14:06:41] (03CR) 10Ema: varnish: Containerize varnish test environment (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [14:10:00] (03PS1) 10Jgiannelos: Revert "tegola-vector-tiles: Make staging cron run every 5 mins for debug purposes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/713862 [14:16:59] (03CR) 10Ema: varnish: Containerize varnish test environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [14:19:35] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mc2021.codfw.wmnet'... [14:27:14] (03CR) 10Btullis: "My PCC runs for this patch are producing an odd result for me." [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [14:36:55] (03CR) 10David Caro: [C: 03+1] openstack: stop paging for systemd alone (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713909 (owner: 10Bstorm) [14:39:41] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2021.codfw.wmnet with reason: REIMAGE [14:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:56] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2021.codfw.wmnet with reason: REIMAGE [14:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:11] (03CR) 10Krinkle: [C: 03+1] ProductionServices: replace redis_lock eqiad servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714049 (https://phabricator.wikimedia.org/T280582) (owner: 10Effie Mouzeli) [14:51:44] (03CR) 10David Caro: [C: 03+1] openstack: stop paging for systemd alone (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713909 (owner: 10Bstorm) [14:52:33] (03CR) 10Effie Mouzeli: [C: 03+2] Revert "tegola-vector-tiles: Make staging cron run every 5 mins for debug purposes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/713862 (owner: 10Jgiannelos) [14:53:22] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2021.codfw.wmnet'] ` and were **ALL** successful. [14:55:19] (03Merged) 10jenkins-bot: Revert "tegola-vector-tiles: Make staging cron run every 5 mins for debug purposes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/713862 (owner: 10Jgiannelos) [15:00:05] jouncebot now [15:00:05] For the next 15 hour(s) and 59 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210820T0700) [15:05:15] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) [15:05:56] (03PS2) 10Effie Mouzeli: ProductionServices: replace redis_lock eqiad servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714049 (https://phabricator.wikimedia.org/T278225) [15:10:58] (03CR) 10jerkins-bot: [V: 04-1] simplelamp2: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/713863 (owner: 10RhinosF1) [15:14:05] (03PS3) 10RhinosF1: simplelamp2: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/713863 [15:14:34] (03CR) 10jerkins-bot: [V: 04-1] simplelamp2: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/713863 (owner: 10RhinosF1) [15:14:36] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [15:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:38] (03PS4) 10RhinosF1: simplelamp2: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/713863 [15:17:57] 10SRE-swift-storage, 10envoy, 10serviceops, 10Patch-For-Review: Envoy and swift HEAD with 204 response turns into 503 - https://phabricator.wikimedia.org/T288815 (10RLazarus) 05Open→03Resolved Sounds good to me! That means the $runtime field is unused anywhere, but I think it's a useful knob to have, s... [15:17:58] (03CR) 10RLazarus: "Abandoning per T288815 as the Swift fix from upstream worked." [puppet] - 10https://gerrit.wikimedia.org/r/713725 (https://phabricator.wikimedia.org/T288815) (owner: 10RLazarus) [15:18:03] (03Abandoned) 10RLazarus: thanos::frontend: Disable Envoy's strict 204 header parsing [puppet] - 10https://gerrit.wikimedia.org/r/713725 (https://phabricator.wikimedia.org/T288815) (owner: 10RLazarus) [15:23:49] !log btullis@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1129.eqiad.wmnet [15:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:40] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1129.eqiad.wmnet [15:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:19] !log deleting various pods from staging to have them recreated with priorities - T289131 [15:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:28] T289131: Enable the Priority admission plugin - https://phabricator.wikimedia.org/T289131 [15:52:05] (03PS1) 10Filippo Giunchedi: icinga: remove reading-web Grafana checks [puppet] - 10https://gerrit.wikimedia.org/r/714067 (https://phabricator.wikimedia.org/T281359) [16:06:10] (03CR) 10Phuedx: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/714067 (https://phabricator.wikimedia.org/T281359) (owner: 10Filippo Giunchedi) [16:10:40] (03PS2) 10Bstorm: openstack: stop paging for systemd alone [puppet] - 10https://gerrit.wikimedia.org/r/713909 (https://phabricator.wikimedia.org/T280493) [16:16:12] (03PS1) 10Jforrester: [WIP] deployment-prep: Add wikifunctions.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/714068 (https://phabricator.wikimedia.org/T284162) [16:26:21] (03PS2) 10JMeybohm: kubernetes: Enable Priority admission plugin [puppet] - 10https://gerrit.wikimedia.org/r/713807 (https://phabricator.wikimedia.org/T289131) [16:26:23] (03PS1) 10JMeybohm: k8s::apiserver: Add admission controller config file [puppet] - 10https://gerrit.wikimedia.org/r/714071 (https://phabricator.wikimedia.org/T289131) [16:28:32] (03PS2) 10JMeybohm: k8s::apiserver: Add admission controller config file [puppet] - 10https://gerrit.wikimedia.org/r/714071 (https://phabricator.wikimedia.org/T289131) [16:36:15] !log btullis@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1133.eqiad.wmnet [16:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:05] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1133.eqiad.wmnet [16:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:32] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10RobH) a:03odimitrijevic >>! In T289258#7295002, @RobH wrote: > @odimitrijevic, > > This is one of three current requests to add a... [16:38:55] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Chmielko Maslak - https://phabricator.wikimedia.org/T289257 (10RobH) a:03odimitrijevic >>! In T289257#7295005, @RobH wrote: > @odimitrijevic, > > This is one of three current requests to a... [16:39:25] (03CR) 10Bstorm: openstack: stop paging for systemd alone (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713909 (https://phabricator.wikimedia.org/T280493) (owner: 10Bstorm) [16:39:34] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10RobH) a:03odimitrijevic >>! In T289259#7294999, @RobH wrote: > @odimitrijevic, > > This is one of three current requests to... [16:39:45] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10RobH) [16:39:59] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10RobH) [16:40:06] (03CR) 10Bstorm: [C: 03+2] openstack: stop paging for systemd alone [puppet] - 10https://gerrit.wikimedia.org/r/713909 (https://phabricator.wikimedia.org/T280493) (owner: 10Bstorm) [16:40:24] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10RobH) [16:40:33] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Chmielko Maslak - https://phabricator.wikimedia.org/T289257 (10RobH) [16:40:56] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10RobH) [16:41:15] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Chmielko Maslak - https://phabricator.wikimedia.org/T289257 (10RobH) [16:43:19] !log btullis@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1134.eqiad.wmnet [16:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:12] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1134.eqiad.wmnet [16:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:01] (03PS1) 10Jcrespo: puppet: Document deprecation of require_packages() on README [puppet] - 10https://gerrit.wikimedia.org/r/714079 (https://phabricator.wikimedia.org/T266479) [16:48:59] (03CR) 10Jcrespo: "This seems to be the consensus." [puppet] - 10https://gerrit.wikimedia.org/r/714079 (https://phabricator.wikimedia.org/T266479) (owner: 10Jcrespo) [16:50:40] (03PS2) 10Jcrespo: puppet: Document deprecation of require_packages() on README [puppet] - 10https://gerrit.wikimedia.org/r/714079 (https://phabricator.wikimedia.org/T266479) [16:54:22] !log btullis@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1139.eqiad.wmnet [16:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:14] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1139.eqiad.wmnet [16:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:51] !log btullis@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1140.eqiad.wmnet [16:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:42] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1140.eqiad.wmnet [16:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:36] !log btullis@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1141.eqiad.wmnet [17:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:22] (03PS1) 10Bstorm: openstack: correct the hiera entry for the systemd alert disablement [puppet] - 10https://gerrit.wikimedia.org/r/714087 (https://phabricator.wikimedia.org/T280493) [17:03:22] (03CR) 10Bstorm: [C: 03+2] openstack: correct the hiera entry for the systemd alert disablement [puppet] - 10https://gerrit.wikimedia.org/r/714087 (https://phabricator.wikimedia.org/T280493) (owner: 10Bstorm) [17:03:28] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1141.eqiad.wmnet [17:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:40] (03CR) 10Majavah: [C: 03+2] Fix webservice failing with error trying to raise exception [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713938 (https://phabricator.wikimedia.org/T289177) (owner: 10Nskaggs) [17:05:18] (03Merged) 10jenkins-bot: Fix webservice failing with error trying to raise exception [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713938 (https://phabricator.wikimedia.org/T289177) (owner: 10Nskaggs) [17:08:50] 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 (10Legoktm) >>! In T288458#7274937, @fgiunchedi wrote: > Hosts are ready to go now, though with swift traffic fully on codfw I don't think we should rebalance now. I see at least two options: > >... [17:09:58] (03CR) 10Bstorm: "The different releases are running the same version of gridengine in Debian. There won't be separate grids, just different queues, so this" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/713663 (https://phabricator.wikimedia.org/T278748) (owner: 10Majavah) [17:11:10] (03CR) 10Bstorm: [C: 04-1] Take OS codename into account for grid compatibility [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/713663 (https://phabricator.wikimedia.org/T278748) (owner: 10Majavah) [17:12:49] (03CR) 10Bstorm: [C: 03+2] "Ok, I have an opportunity to test this live with I9d9d3d2409e4d2e7fff58a49938b6b971f5, so I think I'll try merging now. I wouldn't want th" [puppet] - 10https://gerrit.wikimedia.org/r/713495 (owner: 10Bstorm) [17:17:33] (03PS4) 10Bstorm: Add growthexperiments to allowed logtypes [puppet] - 10https://gerrit.wikimedia.org/r/636436 (https://phabricator.wikimedia.org/T266477) (owner: 10Urbanecm) [17:27:03] (03CR) 10MVernon: [C: 03+1] "Thanks 😊" [puppet] - 10https://gerrit.wikimedia.org/r/714079 (https://phabricator.wikimedia.org/T266479) (owner: 10Jcrespo) [17:30:12] (03CR) 10Bstorm: [C: 03+2] Add growthexperiments to allowed logtypes [puppet] - 10https://gerrit.wikimedia.org/r/636436 (https://phabricator.wikimedia.org/T266477) (owner: 10Urbanecm) [17:30:27] \o/ [17:55:17] 10SRE, 10serviceops: Run httpbb periodically - https://phabricator.wikimedia.org/T289202 (10RLazarus) Sounds reasonable! I'll probably hardcode a canary host at first, then we can look at choosing one automatically. [18:37:51] (03PS1) 10Majavah: php74-sssd: Use Composer 2 and add unzip [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/714105 [18:39:39] (03CR) 10Legoktm: [C: 03+1] "Switching to Composer 2 now makes sense in the long run, and while it's technically a breaking change there doesn't appear to be widesprea" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/714105 (owner: 10Majavah) [18:40:10] (03CR) 10Majavah: [C: 03+2] php74-sssd: Use Composer 2 and add unzip [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/714105 (owner: 10Majavah) [18:40:47] (03Merged) 10jenkins-bot: php74-sssd: Use Composer 2 and add unzip [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/714105 (owner: 10Majavah) [18:55:56] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714107 [18:59:22] (03PS1) 10Majavah: node12-sssd: Use npm 7 from debian packages [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/714108 [19:01:33] (03CR) 10Legoktm: [C: 03+1] "Same basic rationale as switching to Composer 2." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/714108 (owner: 10Majavah) [19:01:43] (03CR) 10Majavah: [C: 03+2] node12-sssd: Use npm 7 from debian packages [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/714108 (owner: 10Majavah) [19:02:17] (03Merged) 10jenkins-bot: node12-sssd: Use npm 7 from debian packages [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/714108 (owner: 10Majavah) [19:32:17] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714112 [19:32:23] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714113 [19:36:33] beautiful [19:45:15] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714118 [19:47:36] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714119 [19:51:35] (03PS1) 10QChris: Add .gitreview [software/durum] - 10https://gerrit.wikimedia.org/r/714120 [19:51:37] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [software/durum] - 10https://gerrit.wikimedia.org/r/714120 (owner: 10QChris) [20:27:47] (03Abandoned) 10Legoktm: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714107 (owner: 10PipelineBot) [20:27:50] (03Abandoned) 10Legoktm: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714112 (owner: 10PipelineBot) [20:27:55] (03Abandoned) 10Legoktm: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714118 (owner: 10PipelineBot) [20:28:14] (03Abandoned) 10Legoktm: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714119 (owner: 10PipelineBot) [20:28:17] (03Abandoned) 10Legoktm: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714113 (owner: 10PipelineBot) [21:11:52] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714129 [21:14:15] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714130 [21:35:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:41:24] (03CR) 10Urbanecm: [C: 03+1] "i'll try to get this live on Monday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713860 (https://phabricator.wikimedia.org/T287091) (owner: 10Zabe) [21:41:59] 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Legoktm) [21:43:59] 10SRE, 10MW-on-K8s, 10Shellbox, 10serviceops, and 3 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Legoktm) [21:44:17] 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Legoktm) 05Open→03Resolved a:03Legoktm [21:45:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:48:18] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Legoktm) [22:02:39] Can I request a quick favor from someone with shell access? Want to make sure my fix for T289384 doesn't break use on WMF wikis. Can someone run [22:02:39] `wfParseUrl( WikiMap::getWiki( 'enwiki' )->getCanonicalServer() )` and let me know the result? [22:02:39] T289384: SpecialGlobalWatchlistSettings site validation fails on Vagrant - https://phabricator.wikimedia.org/T289384 [22:07:15] sure [22:08:18] DannyS712: https://phabricator.wikimedia.org/P17057 [22:08:39] thanks (just wanted to make sure port wasn't magically set there) [22:10:29] (03PS1) 10RLazarus: httpbb: Add hourly test runs via systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/714136 (https://phabricator.wikimedia.org/T289202) [22:10:31] (03PS1) 10RLazarus: hieradata: Run httpbb hourly from cumin2001 against a codfw appserver [puppet] - 10https://gerrit.wikimedia.org/r/714137 (https://phabricator.wikimedia.org/T289202) [22:12:33] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Legoktm) 05Open→03Resolved Everything looks good, going to call this resolved. If you run into any issue... [22:19:40] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Nachtbold) Many thanks for your efforts! I'm glad we're having this feature running again. [23:17:58] !log deployed patch for T289385 [23:18:03] (03PS1) 10Urbanecm: [labs] enwiki: Enable mentorship for 10% of users only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714144 (https://phabricator.wikimedia.org/T287903) [23:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log