[00:00:58] (03CR) 10Dzahn: "Hashar asked back in 2019 that we should _ban_ using ensure=>latest but here we do for releng tools. https://phabricator.wikimedia.org/T2" [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) (owner: 10Reedy) [00:01:00] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:26] (03CR) 10Dzahn: "Currently the 2 _new_ VMs that are not production yet.. and have never cloned this repo before.. are getting an error. The existing buster" [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) (owner: 10Reedy) [00:08:41] (03PS1) 10BryanDavis: python: Replace --mount with --wsgi-file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/925097 (https://phabricator.wikimedia.org/T337897) [00:08:46] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:10:18] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:10:33] ^ flapping thing here .. we need to fix it.. but we will soon [00:10:43] just some race [00:11:01] Reedy: which follow up button? [00:12:00] There's a Follow-up in hte three vertical dots menu next to revert [00:12:10] TIL [00:12:23] (03PS1) 10BryanDavis: python: Replace --mount with --wsgi-file in webservice-runner [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/925099 (https://phabricator.wikimedia.org/T337897) [00:12:56] https://gerritcodereview-test.gsrc.io/user-inline-edit.html [00:13:05] it's been there in one form or another for a while :D [00:13:10] (03CR) 10Dzahn: [C: 03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/925033" [puppet] - 10https://gerrit.wikimedia.org/r/879908 (https://phabricator.wikimedia.org/T290260) (owner: 10Jeena Huneidi) [00:13:38] (03PS4) 10Dzahn: releases: clone repos/releng/release from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) (owner: 10Reedy) [00:15:17] (03CR) 10Dzahn: "status quo:" [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) (owner: 10Reedy) [00:18:04] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:20:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:10] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:34] (03PS1) 10Dzahn: Revert "releases: Ensure rsync jobs get removed on the non-active machine" [puppet] - 10https://gerrit.wikimedia.org/r/925037 [00:23:22] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:00] (03CR) 10Dzahn: [C: 03+2] Revert "releases: Ensure rsync jobs get removed on the non-active machine" [puppet] - 10https://gerrit.wikimedia.org/r/925037 (owner: 10Dzahn) [00:26:28] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:26:53] (03CR) 10Dzahn: [C: 03+2] "the revert added files /etc/ferm/conf.d/10_rsyncd_access_srv-org-wikimedia-releases-releases2002.codfw.wmnet* back on releases1002." [puppet] - 10https://gerrit.wikimedia.org/r/925037 (owner: 10Dzahn) [00:27:58] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:02] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:20] PROBLEM - Disk space on krb1001 is CRITICAL: DISK CRITICAL - free space: / 1766 MB (3% inode=97%): /tmp 1766 MB (3% inode=97%): /var/tmp 1766 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops [00:39:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/925107 [00:39:39] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/925107 (owner: 10TrainBranchBot) [00:58:03] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/925107 (owner: 10TrainBranchBot) [01:23:32] (03CR) 10Ottomata: "Do you think we should merge and apply this, or wait for review on https://github.com/apache/flink-kubernetes-operator/pull/604 ?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/925016 (https://phabricator.wikimedia.org/T336185) (owner: 10TChin) [01:53:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs2021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [02:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:36:00] (03PS1) 10Marostegui: db1155: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/925286 (https://phabricator.wikimedia.org/T337446) [04:37:19] (03CR) 10Marostegui: [C: 03+2] db1155: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/925286 (https://phabricator.wikimedia.org/T337446) (owner: 10Marostegui) [04:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:45:11] (03PS2) 10KartikMistry: MinT: Update to 2023-06-01-041041-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/924915 (https://phabricator.wikimedia.org/T336525) [04:45:54] (03PS1) 10Tim Starling: Remove runphpscriptletonallwikis.py [puppet] - 10https://gerrit.wikimedia.org/r/925295 [04:47:34] (03PS1) 10KartikMistry: Update cxserver to 2023-06-01-041016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/925298 (https://phabricator.wikimedia.org/T337669) [05:11:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:19:21] * kart_ going to update MinT and cxserver [05:21:13] (03PS3) 10KartikMistry: MinT: Update to 2023-06-01-041041-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/924915 (https://phabricator.wikimedia.org/T336525) [05:21:27] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-06-01-041016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/925298 (https://phabricator.wikimedia.org/T337669) (owner: 10KartikMistry) [05:22:13] (03Merged) 10jenkins-bot: Update cxserver to 2023-06-01-041016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/925298 (https://phabricator.wikimedia.org/T337669) (owner: 10KartikMistry) [05:26:51] PROBLEM - Check systemd state on krb1001 is CRITICAL: CRITICAL - degraded: The following units failed: confd_prometheus_metrics.service,prometheus-debian-version-textfile.service,prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:27:18] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:27:38] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:32:14] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:32:49] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:34:14] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:34:46] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:35:28] (03CR) 10KartikMistry: [C: 03+2] MinT: Update to 2023-06-01-041041-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/924915 (https://phabricator.wikimedia.org/T336525) (owner: 10KartikMistry) [05:36:19] (03Merged) 10jenkins-bot: MinT: Update to 2023-06-01-041041-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/924915 (https://phabricator.wikimedia.org/T336525) (owner: 10KartikMistry) [05:39:49] !log Updated cxserver to 2023-06-01-041016-production (T337669) [05:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:52] T337669: Enable MinT, Content and Section Translation for a 2nd group of 10 languages previously lacking machine translation - https://phabricator.wikimedia.org/T337669 [05:42:59] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [05:44:50] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [05:46:48] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [05:49:39] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [05:52:57] RECOVERY - Check systemd state on krb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:53:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs2021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:56:08] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [06:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230601T0600) [06:00:06] kormat, marostegui, and Amir1: That opportune time is upon us again. Time for a Primary database switchover deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230601T0600). [06:01:35] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [06:07:29] RECOVERY - Disk space on krb1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops [06:16:00] !log Updated MinT to 2023-06-01-041041-production (T336525) [06:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:04] T336525: Review code mappings for MinT - https://phabricator.wikimedia.org/T336525 [06:16:12] ^ missed logging this earlier.. [06:23:31] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Stop using mod_access_compat - https://phabricator.wikimedia.org/T258686 (10Joe) Can I ask what's the motivation for wanting to remove the old-style access rules besides keeping up with apache, while apache... [06:26:27] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Stop using mod_access_compat - https://phabricator.wikimedia.org/T258686 (10Joe) Unless someone wants to take on the project of changing every one of our configurations and to verify the changes thoroughly (... [06:27:48] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "This would break the configurations of any apache httpd still using the old 'Allow/Deny' access logic. So, first of all, every single medi" [puppet] - 10https://gerrit.wikimedia.org/r/923615 (https://phabricator.wikimedia.org/T258686) (owner: 10Jbond) [06:31:52] (03PS1) 10Ladsgroup: wmcs: Update the revision_comment_temp index to revision table [puppet] - 10https://gerrit.wikimedia.org/r/925547 (https://phabricator.wikimedia.org/T215466) [06:35:37] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "For the record, it's not just mediawiki:" [puppet] - 10https://gerrit.wikimedia.org/r/923615 (https://phabricator.wikimedia.org/T258686) (owner: 10Jbond) [06:47:11] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "See the previous change." [puppet] - 10https://gerrit.wikimedia.org/r/923616 (https://phabricator.wikimedia.org/T258686) (owner: 10Jbond) [06:51:56] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41483/console" [puppet] - 10https://gerrit.wikimedia.org/r/924985 (owner: 10Majavah) [06:54:02] (03CR) 10Majavah: "This and the following patches can now go in as the last stretch VMs are gone." [puppet] - 10https://gerrit.wikimedia.org/r/924981 (owner: 10Majavah) [07:00:06] Amir1, apergos, and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230601T0700). [07:00:06] isaranto, duesen, and matthiasmullie: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:15] morning! we actually have someone signed up for training today, so we'll be taking things nice and easy for the sake of eplaining everything that's going on. there are three patches set to go, all config changes, so none should take too long. I'd prefer that one of the other backport window runners screenshare and type while I talk about what's being done and why, and what goes on under the hood, etc. [07:00:18] looking at Amir1 and jnuche to see if either of you are around and willing to be the person screensharing. With scap backport there's not much to do but it's still useful for the trainee to see the output at each step. [07:00:39] o/ [07:00:53] PROBLEM - Check systemd state on debmonitor2003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_uwsgi-debmonitor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:01:12] I'm around but I need a bit [07:01:15] hey. I assume you will self deploy? we have two other patches to go, matthiasmullie so it's fine if you want to go first (assuming that the other patch owners are not here in the next 2 minutes) [07:01:32] we are still waiting for the trainee to arrive, so that's fine Amir1 [07:02:09] apergos: I can self deploy, or sit through the regular process if that's helpful for training purposes [07:02:57] ah but can you self deploy and screenshare? :-) really the other two patches will be enough I think for purposes of illustration, if you want to just get it done. [07:03:43] alright sure; I'll move forward with mine real quick then while we're waiting for others to arrive - ok? [07:03:49] sounds great [07:03:53] (03PS2) 10Matthias Mullie: Add $wgInterwikiLogoOverride [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917871 (https://phabricator.wikimedia.org/T315269) [07:04:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917871 (https://phabricator.wikimedia.org/T315269) (owner: 10Matthias Mullie) [07:05:10] (03Merged) 10jenkins-bot: Add $wgInterwikiLogoOverride [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917871 (https://phabricator.wikimedia.org/T315269) (owner: 10Matthias Mullie) [07:06:03] Amir1: o/ [07:06:14] effie: are you around to monitor jobrunners? [07:06:43] Let me see if isaranto is joining, if not, I just deploy your patch [07:06:58] ah there you are! matthiasmullie is going first; as I was mentioning a couple moments ago we have a trainee for this window so we'llbe going through this slowly (once they arrive) [07:07:10] thanks Amir1 [07:07:15] FYI there are 2 unexpected commits; they're both by Roan & marked as "beta", so I expect it's safe to proceed? [07:07:19] apergos: is there a video call? [07:07:20] duesen: yep [07:07:23] go [07:07:30] yes there is a call [07:07:33] matthiasmullie: can you send me the link to the patches? [07:07:35] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/912417 & https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/913002 - both InitialiseSettings-labs.php indeed [07:07:42] (03PS1) 10Muehlenhoff: Bitu: Add an alert for the front page [puppet] - 10https://gerrit.wikimedia.org/r/925650 (https://phabricator.wikimedia.org/T320603) [07:07:44] it's linked off the calendar window [07:07:46] effie: cool. i guess there are a few more patches to go in first [07:07:51] hmm maybe that's not on your calendar [07:07:56] o_O [07:07:57] we can queue :p [07:08:01] yeah, rebase. I'll go bother Roan to rebase them next time [07:08:04] (03CR) 10CI reject: [V: 04-1] Bitu: Add an alert for the front page [puppet] - 10https://gerrit.wikimedia.org/r/925650 (https://phabricator.wikimedia.org/T320603) (owner: 10Muehlenhoff) [07:08:09] did i forget to hit save? [07:08:11] passed it to you in pms [07:08:52] duesen: your patch will be next if ilias does not arrive by the time the first patch is done [07:08:58] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:917871|Add $wgInterwikiLogoOverride (T315269)]] [07:08:59] ah, nvm, i thought you meant my patch isn't on the deployment calendar :) [07:09:02] T315269: Special:Search - Update interwiki widget sister project icons - https://phabricator.wikimedia.org/T315269 [07:09:34] o/ [07:09:54] (03PS2) 10Muehlenhoff: Bitu: Add an alert for the front page [puppet] - 10https://gerrit.wikimedia.org/r/925650 (https://phabricator.wikimedia.org/T320603) [07:09:59] Is it now or in 50'? Where can I find the meeting link? [07:10:03] welcome isaranto! [07:10:18] you shouldhave a link in your calendar but if not let me pm it to you [07:10:29] (03PS6) 10Daniel Kinzler: Enable parser cache warming jobs for parsoid on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923588 (https://phabricator.wikimedia.org/T329366) [07:15:57] (03CR) 10Gmodena: [C: 03+1] Fix overlapping names edge case in flink-operator (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/925016 (https://phabricator.wikimedia.org/T336185) (owner: 10TChin) [07:16:12] (03CR) 10Alexandros Kosiaris: [C: 03+1] "The reason was that until 1.12 kubelet had cadvisor integrated in it. One could pass --cadvisor-port and get a fully functional cadvisor w" [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/924963 (https://phabricator.wikimedia.org/T337836) (owner: 10Elukey) [07:16:18] (03CR) 10Alexandros Kosiaris: debian: remove cadvisor from the kubelet's systemd unit [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/924963 (https://phabricator.wikimedia.org/T337836) (owner: 10Elukey) [07:17:08] effie: can you link me to the metrics to monitor as well? just because I like to look at graphs :) [07:17:30] FYI last scap output was ~7min ago at "07:10:45 K8s images build/push output redirected to /home/mlitn/scap-image-build-and-push-log" - can't remember this taking long in the past? [07:18:01] duesen: lol sure 1' [07:18:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/925650 (https://phabricator.wikimedia.org/T320603) (owner: 10Muehlenhoff) [07:18:15] I don't know about lengths of time, looking at effie for that [07:18:47] duesen: https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=parsoidCachePrewarm&from=now-12h&to=now https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?from=now-6h&orgId=1&to=now [07:20:13] apergos: you mean literally or figuratively ? [07:20:18] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: Setup in progress [07:20:22] (still no progress; scap has been silent now for ~10min) [07:20:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: Setup in progress [07:20:38] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10dcausse) [07:21:00] matthiasmullie: can you please tell what you scapped? [07:21:07] I will see if I can help [07:21:09] nvm, just started moving again [07:21:25] see? It just needed some scare [07:21:29] build-and-push-container-images did in fact take 10min; didn't remember that taking so long in the past :p [07:21:33] haha [07:21:46] ok, I'll try to remember that going forward [07:22:19] (03PS1) 10Tim Starling: Fix some mwscript bugs and clean up the style [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925654 [07:25:56] (03PS3) 10Muehlenhoff: Bitu: Add an alert for the front page [puppet] - 10https://gerrit.wikimedia.org/r/925650 (https://phabricator.wikimedia.org/T320603) [07:29:39] !log mlitn@deploy1002 mlitn: Backport for [[gerrit:917871|Add $wgInterwikiLogoOverride (T315269)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [07:29:42] T315269: Special:Search - Update interwiki widget sister project icons - https://phabricator.wikimedia.org/T315269 [07:31:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/925650 (https://phabricator.wikimedia.org/T320603) (owner: 10Muehlenhoff) [07:32:19] effie: thanks! any idea what caused the bump in rows read about two hours ago? It's still not back to normal.. [07:32:59] hm i guess it is kind of normal [07:33:31] duesen: we were looking at the graphs yesterday with Amir1, it is just "jobs" [07:33:56] and especially this side of the world is waking up, people get busy [07:35:00] !log installing libssh security updates [07:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:37] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:36:17] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:37:49] duesen: on phone. The rows read can be from also from maint work. Don't worry [07:38:39] ok then :) [07:39:25] if turning on the cache warming jobs for frwiki goes well, can we do enwiki+dewiki later today? [07:40:18] Why not all (besides the two monsters)? [07:40:32] So we can call it done [07:42:01] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:917871|Add $wgInterwikiLogoOverride (T315269)]] (duration: 33m 02s) [07:42:04] T315269: Special:Search - Update interwiki widget sister project icons - https://phabricator.wikimedia.org/T315269 [07:42:11] apergos: I'm done; sorry for taking up so much time; scap felt quite slow today, guess it's in need of some coffee too [07:42:27] no worries, we'll just zip through these others hopefully [07:44:04] (03CR) 10Daniel Kinzler: [C: 03+2] Enable parser cache warming jobs for parsoid on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923588 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [07:44:50] (03Merged) 10jenkins-bot: Enable parser cache warming jobs for parsoid on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923588 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [07:45:47] It was probably rebuilding images because of i18n changes. It happens from time to time [07:46:04] The next deploys will be faster [07:46:43] !log daniel@deploy1002 Started scap: Backport for [[gerrit:923588|Enable parser cache warming jobs for parsoid on frwiki (T329366)]] [07:46:46] T329366: Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 [07:46:48] (03PS1) 10Muehlenhoff: swift: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/925656 [07:47:10] (03CR) 10CI reject: [V: 04-1] swift: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/925656 (owner: 10Muehlenhoff) [07:48:19] !log daniel@deploy1002 daniel: Backport for [[gerrit:923588|Enable parser cache warming jobs for parsoid on frwiki (T329366)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [07:50:22] (03PS20) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [07:51:23] (03PS2) 10Muehlenhoff: swift: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/925656 [07:51:30] (03PS21) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [07:53:54] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [07:55:53] !log daniel@deploy1002 Finished scap: Backport for [[gerrit:923588|Enable parser cache warming jobs for parsoid on frwiki (T329366)]] (duration: 09m 09s) [07:55:56] T329366: Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 [07:56:32] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/925656 (owner: 10Muehlenhoff) [07:58:32] (03PS22) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [07:58:58] effie, Amir1: deployed the patch, got a good bump in job insertion rate: https://grafana-rw.wikimedia.org/d/OxxOv5K4k/ve-backend-dashboard?orgId=1&from=now-1h&to=now [08:00:20] went from 40 jobs/sec to 70 jobs/sec, just from adding frwiki. [08:00:37] Adding all large wikis will make it jump to, oh... 1000 or so? [08:00:58] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [08:02:47] It's stacking I think. So 35 per DC. Not too bad [08:03:32] You probably should give the job It's own lane (concurrency) in CP config [08:03:42] If not done already [08:04:33] we're stealing from the citoid window at this point, but I don't see mvolz in here so I think it's ok [08:04:38] (03PS23) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [08:06:21] Amir1: _joe_ was talking about that. I know nothing about changeprop config... [08:06:58] <_joe_> 1k jobs/s it's a lot heh [08:07:05] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [08:07:18] If you have helm charts repo locally, search for refreshlinks there [08:07:40] Actually I'm wrong. The configuration is somewhere else [08:07:47] Search in codesearch [08:08:04] _joe_: that estimate is probably a bit high, after looking at some numbers I would now guess 500. But it's still a lot. [08:08:06] (Sorry still on phone) [08:08:23] <_joe_> duesen: sorry let me check one thing [08:08:28] (I was wrong, we're not stealing from the next window yet, that's 2 hours away. but we are going to go over a bit.) [08:08:56] _joe_: this is not related to the deployment, correct? we can proceed with the next patch? [08:09:22] (03PS24) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [08:09:41] PROBLEM - SSH on wdqs2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:09:44] <_joe_> apergos: yeah sorry [08:09:48] <_joe_> let us move elsewhere [08:09:51] great. [08:11:58] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [08:12:32] (03CR) 10Jcrespo: [C: 03+2] "This keeps failing, generating alert spam. Merging, and later cloud can decide to revert if fixed." [puppet] - 10https://gerrit.wikimedia.org/r/924902 (owner: 10Jcrespo) [08:13:49] (03PS25) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [08:16:05] (03Abandoned) 10Gehel: Cleanup unused hiera variable. [puppet] - 10https://gerrit.wikimedia.org/r/924946 (owner: 10Gehel) [08:16:14] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [08:16:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922512 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [08:17:47] (03Merged) 10jenkins-bot: ORES: add model versions configuration and thresholds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922512 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [08:17:50] (03CR) 10JMeybohm: [C: 03+1] miscweb: add bienvenida release to miscweb staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/924898 (https://phabricator.wikimedia.org/T337047) (owner: 10Jelto) [08:18:12] !log daniel@deploy1002 Started scap: Backport for [[gerrit:922512|ORES: add model versions configuration and thresholds (T319170)]] [08:18:16] T319170: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 [08:18:26] wooooww [08:18:31] isaranto: --^ \o/ [08:19:54] !log daniel@deploy1002 daniel and isaranto: Backport for [[gerrit:922512|ORES: add model versions configuration and thresholds (T319170)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [08:20:22] (03CR) 10Jelto: [C: 03+2] miscweb: add bienvenida release to miscweb staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/924898 (https://phabricator.wikimedia.org/T337047) (owner: 10Jelto) [08:21:18] (03PS1) 10Slyngshede: Netbox dummy OIDC secret [labs/private] - 10https://gerrit.wikimedia.org/r/925666 [08:21:58] (03PS2) 10Tim Starling: Fix some mwscript bugs and clean up the style [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925654 [08:22:55] (03Merged) 10jenkins-bot: miscweb: add bienvenida release to miscweb staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/924898 (https://phabricator.wikimedia.org/T337047) (owner: 10Jelto) [08:25:03] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Netbox dummy OIDC secret [labs/private] - 10https://gerrit.wikimedia.org/r/925666 (owner: 10Slyngshede) [08:27:00] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10kostajh) [08:28:14] !log jelto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [08:28:21] (03CR) 10Kosta Harlan: ipoid: Create iPoid chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/921700 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [08:28:25] !log daniel@deploy1002 Finished scap: Backport for [[gerrit:922512|ORES: add model versions configuration and thresholds (T319170)]] (duration: 10m 12s) [08:28:27] T319170: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 [08:28:36] !log jelto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [08:29:59] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [08:30:14] (03CR) 10Filippo Giunchedi: Bitu: Add an alert for the front page (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/925650 (https://phabricator.wikimedia.org/T320603) (owner: 10Muehlenhoff) [08:30:35] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [08:32:43] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:53] (03PS3) 10Tim Starling: Fix some mwscript bugs and clean up the style [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925654 [08:36:02] (03PS1) 10Fabfur: admin: Add fabfur to ops group [puppet] - 10https://gerrit.wikimedia.org/r/925670 [08:37:38] (03Abandoned) 10Jbond: httpd: set legacy_compat to absent [puppet] - 10https://gerrit.wikimedia.org/r/923615 (https://phabricator.wikimedia.org/T258686) (owner: 10Jbond) [08:37:46] (03Abandoned) 10Jbond: httpd: remove legacy_compat option [puppet] - 10https://gerrit.wikimedia.org/r/923616 (https://phabricator.wikimedia.org/T258686) (owner: 10Jbond) [08:38:47] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:39:20] (03CR) 10Zabe: [C: 03+1] wmcs: Update the revision_comment_temp index to revision table [puppet] - 10https://gerrit.wikimedia.org/r/925547 (https://phabricator.wikimedia.org/T215466) (owner: 10Ladsgroup) [08:39:23] (03PS1) 10JMeybohm: prometheus::k8s: Unconditionally use client certs [puppet] - 10https://gerrit.wikimedia.org/r/925674 (https://phabricator.wikimedia.org/T325268) [08:40:36] (03PS1) 10Jbond: admin: update ssh key for Kimberly Sarabia [puppet] - 10https://gerrit.wikimedia.org/r/925677 [08:40:43] (03PS1) 10JMeybohm: prometheus::k8s: Remove cluster tokens [labs/private] - 10https://gerrit.wikimedia.org/r/925678 (https://phabricator.wikimedia.org/T325268) [08:41:04] (03CR) 10Jbond: [C: 03+2] admin: update ssh key for Kimberly Sarabia [puppet] - 10https://gerrit.wikimedia.org/r/925677 (owner: 10Jbond) [08:43:08] (03CR) 10Tim Starling: "I cherry-picked it to deployment-prep and tested it there:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925654 (owner: 10Tim Starling) [08:44:15] (03PS4) 10Muehlenhoff: Bitu: Add an alert for the front page [puppet] - 10https://gerrit.wikimedia.org/r/925650 (https://phabricator.wikimedia.org/T320603) [08:44:36] (03PS2) 10Ladsgroup: wmcs: Update the revision_comment_temp index to revision table [puppet] - 10https://gerrit.wikimedia.org/r/925547 (https://phabricator.wikimedia.org/T215466) [08:44:38] (03CR) 10Muehlenhoff: Bitu: Add an alert for the front page (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/925650 (https://phabricator.wikimedia.org/T320603) (owner: 10Muehlenhoff) [08:44:43] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmcs: Update the revision_comment_temp index to revision table [puppet] - 10https://gerrit.wikimedia.org/r/925547 (https://phabricator.wikimedia.org/T215466) (owner: 10Ladsgroup) [08:45:09] (03CR) 10Volans: "reply to question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [08:45:20] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41486/console" [puppet] - 10https://gerrit.wikimedia.org/r/925674 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [08:45:49] (03CR) 10Jbond: "LGTM but will need approval from one of the approvers" [puppet] - 10https://gerrit.wikimedia.org/r/925670 (owner: 10Fabfur) [08:45:58] (03PS2) 10Ladsgroup: maintain-views: Drop views on revision_comment_temp [puppet] - 10https://gerrit.wikimedia.org/r/923545 (https://phabricator.wikimedia.org/T275246) (owner: 10Zabe) [08:46:02] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] maintain-views: Drop views on revision_comment_temp [puppet] - 10https://gerrit.wikimedia.org/r/923545 (https://phabricator.wikimedia.org/T275246) (owner: 10Zabe) [08:46:27] 10SRE, 10SRE-Access-Requests: Requesting access to ops group for fabfur - https://phabricator.wikimedia.org/T337911 (10Fabfur) [08:47:32] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] "This is blocking important work and keeping it like this is not fair to volunteer devs that are waiting for it." [puppet] - 10https://gerrit.wikimedia.org/r/923545 (https://phabricator.wikimedia.org/T275246) (owner: 10Zabe) [08:47:40] given that the last patch went aroud over 15 minutes ago and everything still looks good, going to call the backport window done [08:47:41] (03CR) 10Tim Starling: "I did a codesearch for callers of MWMultiVersion::getMediaWikiCli(), and the only relevant thing I found was the script discussed at https" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925654 (owner: 10Tim Starling) [08:47:45] (03PS2) 10Fabfur: admin: Add fabfur to ops group [puppet] - 10https://gerrit.wikimedia.org/r/925670 (https://phabricator.wikimedia.org/T337911) [08:48:02] !log UTC morning backport and config training window done [08:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:25] PROBLEM - Check systemd state on wdqs2021 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:19] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for fabfur - https://phabricator.wikimedia.org/T337911 (10Vgutierrez) We need @KOfori approval for this one :) [08:49:20] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [08:50:19] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for fabfur - https://phabricator.wikimedia.org/T337911 (10Fabfur) [08:50:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:51:17] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2004-dev - aborrero@cumin1001" [08:52:13] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2004-dev - aborrero@cumin1001" [08:52:13] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:55:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:56:37] (03PS3) 10Jbond: sre.cdn: move common functions to base class [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 [08:56:39] (03PS1) 10Jbond: sre.cdn.roll-upgrade-haproxy: migrate to SRELBBatchRunnerBaseCDN [cookbooks] - 10https://gerrit.wikimedia.org/r/925681 [08:56:41] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye [08:56:48] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet wi... [08:57:40] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kafka-test1006.eqiad.wmnet [08:57:41] PROBLEM - Check systemd state on wdqs2021 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:57:51] (03PS1) 10Jelto: miscweb: add bienvenida release to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/925682 (https://phabricator.wikimedia.org/T337047) [08:58:43] (03CR) 10Jbond: sre.cdn: move common functions to base class (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [08:58:48] (03PS11) 10Jbond: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [08:58:50] (03PS4) 10Jbond: sre.cdn: move common functions to base class [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 [08:58:52] (03PS2) 10Jbond: sre.cdn.roll-upgrade-haproxy: migrate to SRELBBatchRunnerBaseCDN [cookbooks] - 10https://gerrit.wikimedia.org/r/925681 [08:58:54] (03CR) 10CI reject: [V: 04-1] sre.cdn: move common functions to base class [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [08:58:56] (03CR) 10CI reject: [V: 04-1] sre.cdn.roll-upgrade-haproxy: migrate to SRELBBatchRunnerBaseCDN [cookbooks] - 10https://gerrit.wikimedia.org/r/925681 (owner: 10Jbond) [09:00:05] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for fabfur - https://phabricator.wikimedia.org/T337911 (10Fabfur) [09:00:11] (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2004-dev: put it into service with new IP address [puppet] - 10https://gerrit.wikimedia.org/r/925683 (https://phabricator.wikimedia.org/T337828) [09:01:02] (03CR) 10CI reject: [V: 04-1] Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [09:01:08] (03CR) 10CI reject: [V: 04-1] sre.cdn.roll-upgrade-haproxy: migrate to SRELBBatchRunnerBaseCDN [cookbooks] - 10https://gerrit.wikimedia.org/r/925681 (owner: 10Jbond) [09:01:15] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 122 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:01:23] (03CR) 10CI reject: [V: 04-1] sre.cdn: move common functions to base class [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [09:01:27] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for fabfur - https://phabricator.wikimedia.org/T337911 (10Vgutierrez) [09:01:28] (03PS1) 10Muehlenhoff: Extend KDC logrotate config to also include krb5kdc.log [puppet] - 10https://gerrit.wikimedia.org/r/925684 (https://phabricator.wikimedia.org/T337906) [09:01:31] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kafka-test1006.eqiad.wmnet [09:01:40] (03PS2) 10Muehlenhoff: Extend KDC logrotate config to also include krb5kdc.log [puppet] - 10https://gerrit.wikimedia.org/r/925684 (https://phabricator.wikimedia.org/T337906) [09:01:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/925684 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff) [09:01:54] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for fabfur - https://phabricator.wikimedia.org/T337911 (10Vgutierrez) p:05Triage→03Medium [09:02:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol2004-dev: put it into service with new IP address [puppet] - 10https://gerrit.wikimedia.org/r/925683 (https://phabricator.wikimedia.org/T337828) (owner: 10Arturo Borrero Gonzalez) [09:02:25] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kafka-test1007.eqiad.wmnet [09:03:17] RECOVERY - SSH on wdqs2021 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:03:49] RECOVERY - Check systemd state on wdqs2021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:04:18] (03PS1) 10Elukey: ml-services: add scale overrides for reverrisk in ml staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/925686 [09:05:07] (03CR) 10CI reject: [V: 04-1] ml-services: add scale overrides for reverrisk in ml staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/925686 (owner: 10Elukey) [09:06:20] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kafka-test1007.eqiad.wmnet [09:07:05] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kafka-test1008.eqiad.wmnet [09:11:00] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kafka-test1008.eqiad.wmnet [09:12:05] !log installed spicerack v7.2.0 on cumin1001 [09:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:18] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::k8s: Unconditionally use client certs [puppet] - 10https://gerrit.wikimedia.org/r/925674 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [09:13:57] PROBLEM - SSH on wdqs2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:13:58] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kafka-test1009.eqiad.wmnet [09:14:19] (03CR) 10Filippo Giunchedi: [C: 03+1] Bitu: Add an alert for the front page [puppet] - 10https://gerrit.wikimedia.org/r/925650 (https://phabricator.wikimedia.org/T320603) (owner: 10Muehlenhoff) [09:14:29] (03CR) 10Volans: [C: 03+2] sre.ganeti.makvm: update the default memory to 1.5 [cookbooks] - 10https://gerrit.wikimedia.org/r/924466 (owner: 10Jbond) [09:15:28] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41488/console" [puppet] - 10https://gerrit.wikimedia.org/r/924970 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [09:16:23] (03Abandoned) 10Elukey: ml-services: add scale overrides for reverrisk in ml staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/925686 (owner: 10Elukey) [09:16:28] !log volans@cumin1001 START - Cookbook sre.hosts.dhcp for host sretest1001.eqiad.wmnet [09:16:48] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest1001.eqiad.wmnet [09:16:52] (03Merged) 10jenkins-bot: sre.ganeti.makvm: update the default memory to 1.5 [cookbooks] - 10https://gerrit.wikimedia.org/r/924466 (owner: 10Jbond) [09:17:49] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kafka-test1009.eqiad.wmnet [09:17:56] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kafka-test1010.eqiad.wmnet [09:18:29] !log remove lv prometheus-global - T288196 [09:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:31] T288196: Retire Prometheus 'global' instance - https://phabricator.wikimedia.org/T288196 [09:21:06] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] prometheus::k8s: Unconditionally use client certs [puppet] - 10https://gerrit.wikimedia.org/r/925674 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [09:21:11] RECOVERY - SSH on wdqs2021 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:21:48] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kafka-test1010.eqiad.wmnet [09:22:32] 10SRE-swift-storage, 10serviceops-collab: Investigate object storage for Gitlab - https://phabricator.wikimedia.org/T336234 (10fgiunchedi) >>! In T336234#8870306, @fgiunchedi wrote: >>>! In T336234#8864792, @MatthewVernon wrote: >> I think here we are talking about using the S3 protocol? That is currently only... [09:25:09] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] prometheus::k8s: Remove cluster tokens [labs/private] - 10https://gerrit.wikimedia.org/r/925678 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [09:30:42] !log aborrero@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye [09:30:54] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with O... [09:32:44] !log installed spicerack v7.2.0 on cumin2002 [09:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:53] (03PS26) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [09:34:04] (03PS1) 10JMeybohm: kubernetes: Remove infrastructure_users static token file [puppet] - 10https://gerrit.wikimedia.org/r/925707 (https://phabricator.wikimedia.org/T325268) [09:35:01] !log cmooney@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol2004-dev'] [09:35:11] (03PS1) 10JMeybohm: Remove profile::kubernetes::infrastructure_users [labs/private] - 10https://gerrit.wikimedia.org/r/925708 (https://phabricator.wikimedia.org/T325268) [09:35:23] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [09:35:33] (03PS27) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [09:36:08] !log cmooney@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol2004-dev'] [09:36:09] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41491/console" [puppet] - 10https://gerrit.wikimedia.org/r/925707 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [09:36:27] !log cmooney@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol2004-dev'] [09:37:23] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41492/console" [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [09:38:01] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [09:40:02] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/924970 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [09:40:50] (03CR) 10Ladsgroup: "I need a bit of time to properly review this, in the mean time, Daniel might have some ideas." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925654 (owner: 10Tim Starling) [09:42:46] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10hnowlan) [09:42:50] (03PS3) 10Jbond: sre: update base class with an upgrade action [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 [09:42:52] (03PS1) 10Jbond: sre: base class apply black [cookbooks] - 10https://gerrit.wikimedia.org/r/925711 [09:43:06] !log cmooney@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcontrol2004-dev'] [09:43:24] (03PS2) 10JMeybohm: kubernetes: Remove infrastructure_users static token file [puppet] - 10https://gerrit.wikimedia.org/r/925707 (https://phabricator.wikimedia.org/T325268) [09:45:08] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41493/console" [puppet] - 10https://gerrit.wikimedia.org/r/925707 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [09:45:22] (03CR) 10CI reject: [V: 04-1] sre: update base class with an upgrade action [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 (owner: 10Jbond) [09:49:37] !log cmooney@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye [09:49:45] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet wit... [09:50:38] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Future of Thumbor's memcached backend - https://phabricator.wikimedia.org/T318695 (10jijiki) [09:52:04] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Future of Thumbor's memcached backend - https://phabricator.wikimedia.org/T318695 (10jijiki) [09:52:31] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/925711 (owner: 10Jbond) [09:52:38] !log cleaning apt archives on an-test-worker1002: `sudo apt-get clean`, recovering 14G [09:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:48] stevemunene: ^ [09:53:39] !log ladsgroup@mwmaint1002:~$ foreachwikiindblist group2 extensions/AbuseFilter/maintenance/MigrateActorsAF.php (T336224) [09:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:41] T336224: Run MigrateActorsAF on all wikis - https://phabricator.wikimedia.org/T336224 [09:53:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs2021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:56:04] !log installing systemd security updates on bullseye [09:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:17] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10cmooney) >>! In T337828#8892772, @Jhancock.wm wrote: > @aborrero server has been re-racked in B1 - U21 and connected to the cloudsw-b1... [09:56:22] (03PS28) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [09:56:33] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10cmooney) [09:56:36] dcausse, ryankemper, inflatador: ^^^ is the consumer lag expect on wdqs2021? It seems to be recovering. [09:56:56] (03CR) 10Elukey: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/925707 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [09:57:41] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41494/console" [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [09:58:29] gehel: this machine has been on and off for several days I think, not sure what's going on there [09:58:47] (03PS4) 10Jbond: sre: update base class with an upgrade action [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 [09:58:58] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [09:59:25] dcausse: I'll open a phab task, and hopefully, inflatador or ryankemper can have a look [09:59:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2114.codfw.wmnet with reason: Maintenance [10:00:04] mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230601T1000). nyaa~ [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230601T1000) [10:00:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2114.codfw.wmnet with reason: Maintenance [10:00:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2114 (T336886)', diff saved to https://phabricator.wikimedia.org/P48676 and previous config saved to /var/cache/conftool/dbconfig/20230601-100011-ladsgroup.json [10:00:28] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [10:01:16] (03CR) 10CI reject: [V: 04-1] sre: update base class with an upgrade action [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 (owner: 10Jbond) [10:01:46] wdqs2021 is not pooled, so no emergency there. That's probably part of the test of Bullseye migration: T331300 [10:01:46] T331300: Ensure WCQS/WDQS stack works on Bullseye - https://phabricator.wikimedia.org/T331300 [10:01:48] (03CR) 10Elukey: "Precautionary question just to be sure - on dse I see users like:" [puppet] - 10https://gerrit.wikimedia.org/r/925707 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [10:01:53] (03PS5) 10Jbond: sre: update base class with an upgrade action [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 [10:02:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T336886)', diff saved to https://phabricator.wikimedia.org/P48677 and previous config saved to /var/cache/conftool/dbconfig/20230601-100224-ladsgroup.json [10:03:54] (03CR) 10Elukey: [C: 03+1] "I see it should be covered by https://gerrit.wikimedia.org/r/c/operations/puppet/+/904500, so I assume that all users are already migrated" [puppet] - 10https://gerrit.wikimedia.org/r/925707 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [10:04:25] (03PS29) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [10:04:32] (03CR) 10CI reject: [V: 04-1] sre: update base class with an upgrade action [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 (owner: 10Jbond) [10:06:50] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [10:13:25] (03PS6) 10Jbond: sre: update base class with an upgrade action [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 [10:13:48] (03CR) 10Jbond: sre: update base class with an upgrade action (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 (owner: 10Jbond) [10:14:10] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [10:14:40] (03PS7) 10Jbond: sre: update base class with an upgrade action [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 [10:16:09] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2004-dev.private.codfw.wikimedia.cloud - aborrero@cumin2002" [10:16:53] (03CR) 10CI reject: [V: 04-1] sre: update base class with an upgrade action [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 (owner: 10Jbond) [10:17:15] !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2004-dev.private.codfw.wikimedia.cloud - aborrero@cumin2002" [10:17:15] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:17:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P48678 and previous config saved to /var/cache/conftool/dbconfig/20230601-101730-ladsgroup.json [10:23:03] (03PS8) 10Jbond: sre: update base class with an upgrade action [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 [10:27:55] (03CR) 10Jbond: "see comment" [puppet] - 10https://gerrit.wikimedia.org/r/925684 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff) [10:28:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:28:25] (03CR) 10Jbond: [C: 03+2] sre: base class apply black [cookbooks] - 10https://gerrit.wikimedia.org/r/925711 (owner: 10Jbond) [10:28:51] (03CR) 10Volans: [C: 04-1] "I think this needs a bit more discussion/agreement" [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 (owner: 10Jbond) [10:29:28] (03CR) 10Muehlenhoff: Extend KDC logrotate config to also include krb5kdc.log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925684 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff) [10:31:13] (03Merged) 10jenkins-bot: sre: base class apply black [cookbooks] - 10https://gerrit.wikimedia.org/r/925711 (owner: 10Jbond) [10:32:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P48679 and previous config saved to /var/cache/conftool/dbconfig/20230601-103236-ladsgroup.json [10:33:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:33:26] (03CR) 10ArielGlenn: "This can go; it was part of the long dead production of media bundles for download. I didn't even know it was still in here, thanks for to" [puppet] - 10https://gerrit.wikimedia.org/r/925295 (owner: 10Tim Starling) [10:35:42] (03PS3) 10Muehlenhoff: Extend KDC logrotate config to also include krb5kdc.log [puppet] - 10https://gerrit.wikimedia.org/r/925684 (https://phabricator.wikimedia.org/T337906) [10:35:54] (03PS4) 10Muehlenhoff: Extend KDC logrotate config to also include krb5kdc.log [puppet] - 10https://gerrit.wikimedia.org/r/925684 (https://phabricator.wikimedia.org/T337906) [10:37:32] (03PS9) 10Jbond: sre: update base class with an upgrade action [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 [10:39:51] (03CR) 10CI reject: [V: 04-1] sre: update base class with an upgrade action [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 (owner: 10Jbond) [10:40:32] (03CR) 10Filippo Giunchedi: [C: 03+1] debian: remove cadvisor from the kubelet's systemd unit (031 comment) [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/924963 (https://phabricator.wikimedia.org/T337836) (owner: 10Elukey) [10:45:54] !log cmooney@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye [10:46:02] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS... [10:46:59] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for fabfur - https://phabricator.wikimedia.org/T337911 (10KOfori) Approved. [10:47:35] (03CR) 10Jbond: sre: update base class with an upgrade action (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 (owner: 10Jbond) [10:47:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T336886)', diff saved to https://phabricator.wikimedia.org/P48680 and previous config saved to /var/cache/conftool/dbconfig/20230601-104742-ladsgroup.json [10:47:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2117.codfw.wmnet with reason: Maintenance [10:47:46] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [10:47:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2117.codfw.wmnet with reason: Maintenance [10:48:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T336886)', diff saved to https://phabricator.wikimedia.org/P48681 and previous config saved to /var/cache/conftool/dbconfig/20230601-104803-ladsgroup.json [10:48:48] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/925684 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff) [10:51:56] (03CR) 10Muehlenhoff: [C: 03+2] Extend KDC logrotate config to also include krb5kdc.log [puppet] - 10https://gerrit.wikimedia.org/r/925684 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff) [10:53:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T336886)', diff saved to https://phabricator.wikimedia.org/P48682 and previous config saved to /var/cache/conftool/dbconfig/20230601-105303-ladsgroup.json [10:53:07] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [11:00:19] (03PS1) 10Muehlenhoff: Cloud: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/925721 [11:03:10] (03PS6) 10Effie Mouzeli: ipoid: Create iPoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/921700 (https://phabricator.wikimedia.org/T336163) [11:03:49] (03CR) 10CI reject: [V: 04-1] ipoid: Create iPoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/921700 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [11:04:24] (03PS2) 10Jelto: microsites: remove bienvenida.wikimedia.org, migrated to k8s [puppet] - 10https://gerrit.wikimedia.org/r/923655 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [11:04:44] !log disabling puppet on all kubernestes control planes for https://gerrit.wikimedia.org/r/c/operations/puppet/+/925707 [11:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:29] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes: Remove infrastructure_users static token file [puppet] - 10https://gerrit.wikimedia.org/r/925707 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [11:08:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P48683 and previous config saved to /var/cache/conftool/dbconfig/20230601-110810-ladsgroup.json [11:11:19] (03PS30) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [11:11:47] (03PS1) 10JMeybohm: kubernetes: Remove infrastructure_users static token file [puppet] - 10https://gerrit.wikimedia.org/r/925726 (https://phabricator.wikimedia.org/T325268) [11:12:57] (03PS1) 10Jelto: trafficserver: switch bienvenida.wikimedia.org backend [puppet] - 10https://gerrit.wikimedia.org/r/925727 (https://phabricator.wikimedia.org/T300171) [11:13:26] (03CR) 10Majavah: [C: 03+1] "some parts of this patch are duplicates of patches I sent yesterday, starting from https://gerrit.wikimedia.org/r/c/operations/puppet/+/92" [puppet] - 10https://gerrit.wikimedia.org/r/925721 (owner: 10Muehlenhoff) [11:13:51] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [11:14:50] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41496/console" [puppet] - 10https://gerrit.wikimedia.org/r/925726 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [11:15:13] (03PS2) 10JMeybohm: kubernetes: Remove infrastructure_users static token file [puppet] - 10https://gerrit.wikimedia.org/r/925726 (https://phabricator.wikimedia.org/T325268) [11:15:46] (03CR) 10Jbond: Add a cookbook to drain a Ganeti node (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [11:16:50] (03PS2) 10Daimona Eaytoy: Set $wgCampaignEventsUseNewTrackingToolsSchema to true in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923305 (https://phabricator.wikimedia.org/T336364) [11:17:30] (03PS1) 10Slyngshede: Netbox, fix missing : in variable name [labs/private] - 10https://gerrit.wikimedia.org/r/925729 [11:17:47] (03CR) 10JMeybohm: [C: 03+2] kubernetes: Remove infrastructure_users static token file [puppet] - 10https://gerrit.wikimedia.org/r/925726 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [11:18:07] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Netbox, fix missing : in variable name [labs/private] - 10https://gerrit.wikimedia.org/r/925729 (owner: 10Slyngshede) [11:19:19] (03PS10) 10Jbond: sre: update base class with an upgrade action [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 [11:19:42] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Future of Thumbor's memcached backend - https://phabricator.wikimedia.org/T318695 (10hnowlan) [11:19:56] (03PS31) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [11:20:06] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Future of Thumbor's memcached backend - https://phabricator.wikimedia.org/T318695 (10hnowlan) fwiw I see no reason not to move to mcrouter [11:20:23] (03CR) 10Jbond: [C: 03+2] proffile::firewall: create new firewall profile [puppet] - 10https://gerrit.wikimedia.org/r/922815 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [11:22:46] (03CR) 10CI reject: [V: 04-1] sre: update base class with an upgrade action [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 (owner: 10Jbond) [11:23:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P48684 and previous config saved to /var/cache/conftool/dbconfig/20230601-112316-ladsgroup.json [11:24:23] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41497/console" [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [11:25:45] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:26:14] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:26:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:28:15] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [11:28:50] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [11:28:52] (03PS1) 10KartikMistry: Use direct Parsoid in Small and Medium Wikis for Content Translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925742 (https://phabricator.wikimedia.org/T337922) [11:29:01] (03PS32) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) [11:30:23] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41498/console" [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) (owner: 10Slyngshede) [11:30:27] (03PS14) 10Jbond: profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) [11:30:29] (03PS4) 10Jbond: base::firewall: remove the old firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/922816 (https://phabricator.wikimedia.org/T279683) [11:30:31] (03PS14) 10Jbond: firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 [11:30:37] (03PS16) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) [11:31:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:31:31] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) (owner: 10Slyngshede) [11:32:19] (03PS33) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) [11:32:31] (03PS1) 10Jbond: firewall: Surround srange with braces [puppet] - 10https://gerrit.wikimedia.org/r/925744 [11:34:13] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41499/console" [puppet] - 10https://gerrit.wikimedia.org/r/925744 (owner: 10Jbond) [11:34:21] (03CR) 10CI reject: [V: 04-1] firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [11:34:32] (03CR) 10Jbond: [V: 03+1 C: 03+2] firewall: Surround srange with braces [puppet] - 10https://gerrit.wikimedia.org/r/925744 (owner: 10Jbond) [11:34:37] (03CR) 10Jbond: [V: 03+2 C: 03+2] firewall: Surround srange with braces [puppet] - 10https://gerrit.wikimedia.org/r/925744 (owner: 10Jbond) [11:36:38] (03PS34) 10Slyngshede: P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) [11:36:40] (03PS3) 10Dreamy Jazz: Always collapse by default the CheckUserHelper on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886370 (https://phabricator.wikimedia.org/T328726) [11:37:05] (03CR) 10Slyngshede: "Enable OIDC authentication for netbox-next" [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) (owner: 10Slyngshede) [11:37:31] (03CR) 10Jbond: [C: 03+2] ferm::service: allow passing array of hosts [puppet] - 10https://gerrit.wikimedia.org/r/919300 (owner: 10Majavah) [11:37:48] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:38:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:38:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T336886)', diff saved to https://phabricator.wikimedia.org/P48685 and previous config saved to /var/cache/conftool/dbconfig/20230601-113822-ladsgroup.json [11:38:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance [11:38:25] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10SGill) @SCherukuwada - Additional context on this ticket. This issue has been reported multiple times in Wikisource Telegram group and also in the... [11:38:26] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [11:38:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance [11:38:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T336886)', diff saved to https://phabricator.wikimedia.org/P48686 and previous config saved to /var/cache/conftool/dbconfig/20230601-113843-ladsgroup.json [11:39:01] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41500/console" [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) (owner: 10Slyngshede) [11:39:10] (03CR) 10Jbond: [C: 03+2] P:cumin::cloud_targets: use array for srange [puppet] - 10https://gerrit.wikimedia.org/r/924909 (owner: 10Jbond) [11:41:35] (03CR) 10Jbond: [C: 03+2] "All merged and tested thanks for all the work" [puppet] - 10https://gerrit.wikimedia.org/r/924909 (owner: 10Jbond) [11:43:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T336886)', diff saved to https://phabricator.wikimedia.org/P48687 and previous config saved to /var/cache/conftool/dbconfig/20230601-114342-ladsgroup.json [11:43:46] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [11:46:26] (03CR) 10JMeybohm: [C: 03+1] miscweb: add bienvenida release to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/925682 (https://phabricator.wikimedia.org/T337047) (owner: 10Jelto) [11:48:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs2021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:49:48] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:53:22] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Remove profile::kubernetes::infrastructure_users [labs/private] - 10https://gerrit.wikimedia.org/r/925708 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [11:54:03] (KubernetesAPILatency) resolved: (10) High Kubernetes API latency (LIST clusterroles) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:54:48] (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST clusterroles) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:57:49] (03CR) 10Jelto: [C: 03+2] miscweb: add bienvenida release to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/925682 (https://phabricator.wikimedia.org/T337047) (owner: 10Jelto) [11:58:39] (03Merged) 10jenkins-bot: miscweb: add bienvenida release to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/925682 (https://phabricator.wikimedia.org/T337047) (owner: 10Jelto) [11:58:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P48688 and previous config saved to /var/cache/conftool/dbconfig/20230601-115848-ladsgroup.json [11:59:03] (KubernetesAPILatency) resolved: (11) High Kubernetes API latency (LIST clusterroles) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:00:06] Daimona: Dear deployers, time to do the Create new table for the CampaignEvents extension deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230601T1200). [12:03:23] !log Creating ce_tracking_tools table for the CampaignEvents extension on testwiki, test2wiki, officewiki, and metawiki # T336365 [12:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:26] T336365: Create the ce_tracking_tools table in production - https://phabricator.wikimedia.org/T336365 [12:04:49] (03PS1) 10Ladsgroup: Repool half of wikireplicas, depool the other half [puppet] - 10https://gerrit.wikimedia.org/r/925750 (https://phabricator.wikimedia.org/T337734) [12:04:55] (03PS1) 10Jelto: miscweb: change path for probes [deployment-charts] - 10https://gerrit.wikimedia.org/r/925751 (https://phabricator.wikimedia.org/T337041) [12:05:42] (03PS2) 10Ladsgroup: Repool half of wikireplicas, depool the other half [puppet] - 10https://gerrit.wikimedia.org/r/925750 (https://phabricator.wikimedia.org/T337734) [12:05:48] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Repool half of wikireplicas, depool the other half [puppet] - 10https://gerrit.wikimedia.org/r/925750 (https://phabricator.wikimedia.org/T337734) (owner: 10Ladsgroup) [12:08:35] FTR, I'm done with T336365 [12:08:39] T336365: Create the ce_tracking_tools table in production - https://phabricator.wikimedia.org/T336365 [12:10:06] (03PS1) 10Slyngshede: C:idm::deployment rename OIDC service variable. [puppet] - 10https://gerrit.wikimedia.org/r/925758 [12:12:16] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 8 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:13:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P48689 and previous config saved to /var/cache/conftool/dbconfig/20230601-121354-ladsgroup.json [12:14:08] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41501/console" [puppet] - 10https://gerrit.wikimedia.org/r/925758 (owner: 10Slyngshede) [12:15:07] (03PS2) 10Jelto: miscweb: change path for readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/925751 (https://phabricator.wikimedia.org/T337041) [12:16:21] !log jelto@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:16:45] !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:17:03] !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:17:14] !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:27:21] (03CR) 10Vgutierrez: [C: 03+1] "Approved by Kwaku on https://phabricator.wikimedia.org/T337911#8894944" [puppet] - 10https://gerrit.wikimedia.org/r/925670 (https://phabricator.wikimedia.org/T337911) (owner: 10Fabfur) [12:28:57] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: codfw1dev: rework rabbitmq CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/925761 (https://phabricator.wikimedia.org/T336808) [12:29:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T336886)', diff saved to https://phabricator.wikimedia.org/P48690 and previous config saved to /var/cache/conftool/dbconfig/20230601-122900-ladsgroup.json [12:29:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [12:29:04] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [12:29:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [12:30:44] (03PS15) 10Muehlenhoff: Add a cookbook to drain a Ganeti node [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) [12:31:22] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: refresh rabbitmq nodes [puppet] - 10https://gerrit.wikimedia.org/r/925762 (https://phabricator.wikimedia.org/T336808) [12:32:15] (03CR) 10Muehlenhoff: Cloud: Remove support for Stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925721 (owner: 10Muehlenhoff) [12:32:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance [12:32:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance [12:32:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2151 (T336886)', diff saved to https://phabricator.wikimedia.org/P48691 and previous config saved to /var/cache/conftool/dbconfig/20230601-123236-ladsgroup.json [12:33:00] (03CR) 10CI reject: [V: 04-1] Add a cookbook to drain a Ganeti node [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [12:37:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T336886)', diff saved to https://phabricator.wikimedia.org/P48692 and previous config saved to /var/cache/conftool/dbconfig/20230601-123720-ladsgroup.json [12:37:24] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [12:39:55] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/925656 (owner: 10Muehlenhoff) [12:40:17] (03PS16) 10Muehlenhoff: Add a cookbook to drain a Ganeti node [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) [12:42:18] (03CR) 10Muehlenhoff: swift: Remove stretch support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925656 (owner: 10Muehlenhoff) [12:43:50] (03CR) 10MVernon: [C: 03+1] swift: Remove stretch support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925656 (owner: 10Muehlenhoff) [12:44:26] (03PS1) 10Muehlenhoff: Switch debmonitor to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/925764 [12:44:56] (03PS11) 10Jbond: sre: update base class with an upgrade action [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 [12:45:37] (03CR) 10Nskaggs: python: Replace --mount with --wsgi-file in webservice-runner (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/925099 (https://phabricator.wikimedia.org/T337897) (owner: 10BryanDavis) [12:46:15] (03CR) 10Alexandros Kosiaris: [C: 04-1] ipoid: Create iPoid chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/921700 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [12:47:48] (03CR) 10Muehlenhoff: [C: 03+2] Switch debmonitor to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/925764 (owner: 10Muehlenhoff) [12:48:35] (03CR) 10Andrew Bogott: [C: 03+1] wikimediacloud.org: codfw1dev: rework rabbitmq CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/925761 (https://phabricator.wikimedia.org/T336808) (owner: 10Arturo Borrero Gonzalez) [12:48:46] (03CR) 10Andrew Bogott: [C: 03+1] openstack: codfw1dev: refresh rabbitmq nodes [puppet] - 10https://gerrit.wikimedia.org/r/925762 (https://phabricator.wikimedia.org/T336808) (owner: 10Arturo Borrero Gonzalez) [12:49:26] (03CR) 10MVernon: [C: 03+1] swift: Remove stretch support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925656 (owner: 10Muehlenhoff) [12:49:46] !log jelto@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [12:50:01] !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [12:50:44] (03CR) 10Muehlenhoff: Add a cookbook to drain a Ganeti node (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [12:52:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P48693 and previous config saved to /var/cache/conftool/dbconfig/20230601-125226-ladsgroup.json [12:52:35] !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [12:52:53] !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [12:54:45] (03PS1) 10Muehlenhoff: Switch mw canaries to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/925765 [12:57:29] (03CR) 10Jelto: [C: 03+2] httpbb: move tests for bienvenida.wikimedia.org to miscweb-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923656 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [12:57:37] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [12:57:55] (03PS1) 10Daimona Eaytoy: beta: Stop setting unused $wgCampaignEventsUseNewTrackingToolsSchema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925766 (https://phabricator.wikimedia.org/T336362) [12:57:57] jouncebot: nowandnext [12:57:57] For the next 0 hour(s) and 2 minute(s): Create new table for the CampaignEvents extension (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230601T1200) [12:57:57] In 0 hour(s) and 2 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230601T1300) [12:57:57] In 0 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230601T1300) [12:58:15] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230601T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230601T1300). Please do the needful. [13:00:05] Daimona: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] o/ [13:00:25] i can deploy today [13:00:26] hi Daimona [13:00:26] (in a meeting, any other deployers around?) [13:00:28] yay [13:00:33] enjoy your meeting TheresNoTime [13:00:42] Hi, and thank you :) [13:00:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925766 (https://phabricator.wikimedia.org/T336362) (owner: 10Daimona Eaytoy) [13:00:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923305 (https://phabricator.wikimedia.org/T336364) (owner: 10Daimona Eaytoy) [13:01:52] (03Merged) 10jenkins-bot: beta: Stop setting unused $wgCampaignEventsUseNewTrackingToolsSchema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925766 (https://phabricator.wikimedia.org/T336362) (owner: 10Daimona Eaytoy) [13:01:56] (03Merged) 10jenkins-bot: Set $wgCampaignEventsUseNewTrackingToolsSchema to true in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923305 (https://phabricator.wikimedia.org/T336364) (owner: 10Daimona Eaytoy) [13:02:23] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:925766|beta: Stop setting unused $wgCampaignEventsUseNewTrackingToolsSchema (T336362)]], [[gerrit:923305|Set $wgCampaignEventsUseNewTrackingToolsSchema to true in prod (T336364)]] [13:02:29] T336364: Update prod to use the new tracking tools schema - https://phabricator.wikimedia.org/T336364 [13:02:30] T336362: Update beta to use the new tracking tools schema - https://phabricator.wikimedia.org/T336362 [13:03:42] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 20 days, 0:00:00 on wdqs2021.codfw.wmnet with reason: attempting WDQS stack on bullseye [13:04:04] !log urbanecm@deploy1002 urbanecm and daimona: Backport for [[gerrit:925766|beta: Stop setting unused $wgCampaignEventsUseNewTrackingToolsSchema (T336362)]], [[gerrit:923305|Set $wgCampaignEventsUseNewTrackingToolsSchema to true in prod (T336364)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [13:04:06] ^ inflatador thanks! [13:04:06] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20 days, 0:00:00 on wdqs2021.codfw.wmnet with reason: attempting WDQS stack on bullseye [13:04:14] Daimona: your patch is at mwdebug1002, can you test? [13:04:14] (03CR) 10TChin: Fix overlapping names edge case in flink-operator (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/925016 (https://phabricator.wikimedia.org/T336185) (owner: 10TChin) [13:04:18] (03CR) 10Jbond: "lgtm but id rename things to drop the cas_ prefix. we should use the cas_ prefix for when we are configuring CAS the protocol. for thing" [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) (owner: 10Slyngshede) [13:04:25] Sure, giving it a try! [13:06:30] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/925758 (owner: 10Slyngshede) [13:06:41] urbanecm: It's looking good to me, the patch should be a noop and I verified that nothing has exploded or caught fire [13:06:53] sounds good enough to me! proceeding :) [13:07:24] (03CR) 10Jbond: [C: 03+1] admin: Add fabfur to ops group [puppet] - 10https://gerrit.wikimedia.org/r/925670 (https://phabricator.wikimedia.org/T337911) (owner: 10Fabfur) [13:07:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P48694 and previous config saved to /var/cache/conftool/dbconfig/20230601-130732-ladsgroup.json [13:09:33] (03PS1) 10Muehlenhoff: Switch ganeti_test to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/925768 [13:10:22] (03CR) 10Jelto: [C: 03+1] "looks good after merging I1565bd33aa4fe59a4e25d27faf76042c1dd6deac" [puppet] - 10https://gerrit.wikimedia.org/r/923655 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [13:11:06] (03Abandoned) 10Muehlenhoff: Switch mw canaries to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/925765 (owner: 10Muehlenhoff) [13:13:31] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:925766|beta: Stop setting unused $wgCampaignEventsUseNewTrackingToolsSchema (T336362)]], [[gerrit:923305|Set $wgCampaignEventsUseNewTrackingToolsSchema to true in prod (T336364)]] (duration: 11m 08s) [13:13:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:13:35] T336364: Update prod to use the new tracking tools schema - https://phabricator.wikimedia.org/T336364 [13:13:36] T336362: Update beta to use the new tracking tools schema - https://phabricator.wikimedia.org/T336362 [13:13:36] Daimona: and it's live [13:13:37] anything else? [13:14:01] No, that should be all, thank you! I also have to run a schema change on beta as a consequence of that change, can I do that now? [13:15:21] (03CR) 10Vgutierrez: [C: 03+2] admin: Add fabfur to ops group [puppet] - 10https://gerrit.wikimedia.org/r/925670 (https://phabricator.wikimedia.org/T337911) (owner: 10Fabfur) [13:15:41] (03CR) 10Jbond: [C: 03+1] "lgtm adding Andrew to get +1 as well" [puppet] - 10https://gerrit.wikimedia.org/r/924982 (owner: 10Majavah) [13:15:43] (03CR) 10Muehlenhoff: [C: 03+2] Switch ganeti_test to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/925768 (owner: 10Muehlenhoff) [13:16:13] Daimona: beta should run schema changes automatically, assuming they're properly announced via onLoadExtensionSchemaUpdates. [13:16:41] They're not, because it uses the wikishared database and the SchemaUpdates hook doesn't really support that unfortunately... [13:17:28] ah, i see. let's run it manually then. [13:17:31] This is the task BTW https://phabricator.wikimedia.org/T336362 [13:17:32] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for fabfur - https://phabricator.wikimedia.org/T337911 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [13:17:38] Cool, I'm going to do that now :) [13:18:30] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/924981 (owner: 10Majavah) [13:18:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:19:03] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/924983 (owner: 10Majavah) [13:19:55] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm::deployment rename OIDC service variable. [puppet] - 10https://gerrit.wikimedia.org/r/925758 (owner: 10Slyngshede) [13:19:57] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/924984 (owner: 10Majavah) [13:21:26] Done, and verified that nothing broke [13:21:50] !log Removing obsolete mediawiki-services-function-orchestrator from registry - T337505 [13:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:53] T337505: Please hide from the docker registry two no-longer-used Abstract Wiki images (now moved to GitLab) - https://phabricator.wikimedia.org/T337505 [13:22:15] that's always great to hear :) [13:22:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T336886)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20230601-132238-ladsgroup.json [13:22:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance [13:22:48] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [13:22:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance [13:23:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [13:23:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [13:23:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T336886)', diff saved to https://phabricator.wikimedia.org/P48695 and previous config saved to /var/cache/conftool/dbconfig/20230601-132319-ladsgroup.json [13:24:36] (03CR) 10Muehlenhoff: "One comment inline, otherwise looks good. I've tested this with role:debmonitor and role::ganeti_test (so needs rebasing) and everything w" [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [13:25:17] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/924985 (owner: 10Majavah) [13:26:02] (03PS1) 10Daimona Eaytoy: prod: Stop setting unused $wgCampaignEventsUseNewTrackingToolsSchema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925775 (https://phabricator.wikimedia.org/T336364) [13:26:46] (03CR) 10Daimona Eaytoy: [C: 04-1] "Should not be merged until wmf.12 (which includes the removal of the flag from the extension) reaches production." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925775 (https://phabricator.wikimedia.org/T336364) (owner: 10Daimona Eaytoy) [13:27:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T336886)', diff saved to https://phabricator.wikimedia.org/P48697 and previous config saved to /var/cache/conftool/dbconfig/20230601-132758-ladsgroup.json [13:28:02] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [13:29:46] !log installing openssl security updates on bullseye [13:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:35] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] signup:blocklist Expand blocklist feature [software/bitu] - 10https://gerrit.wikimedia.org/r/919005 (owner: 10Slyngshede) [13:35:56] (03PS1) 10Clément Goubert: mediawiki: Add terminationGracePeriodSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/925776 (https://phabricator.wikimedia.org/T331609) [13:39:28] (03PS1) 10Fabfur: hiera: Swap port 80 from varnish to haproxy on all drmrs clusters [puppet] - 10https://gerrit.wikimedia.org/r/925779 (https://phabricator.wikimedia.org/T323557) [13:41:08] (03CR) 10JMeybohm: [C: 03+1] miscweb: change path for readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/925751 (https://phabricator.wikimedia.org/T337041) (owner: 10Jelto) [13:41:18] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/925779 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [13:42:19] (03CR) 10Jbond: [C: 03+1] "yes this make sense" [puppet] - 10https://gerrit.wikimedia.org/r/924525 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [13:43:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P48698 and previous config saved to /var/cache/conftool/dbconfig/20230601-134304-ladsgroup.json [13:43:35] (03CR) 10Jelto: [C: 03+2] miscweb: change path for readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/925751 (https://phabricator.wikimedia.org/T337041) (owner: 10Jelto) [13:44:26] (03Merged) 10jenkins-bot: miscweb: change path for readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/925751 (https://phabricator.wikimedia.org/T337041) (owner: 10Jelto) [13:44:39] 10SRE, 10serviceops-radar: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10Joe) I want to add a bit of context given I'm about to go on PTO, so that others can pick up this work. Thanks @MoritzMuehlenhoff for your work up to this point. Since the last ICU transition happened, T26... [13:44:55] 10SRE, 10serviceops-radar: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10Joe) [13:45:56] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10Jhancock.wm) @cmooney That was my bad. didn't get it seated all the way. You should be good now! [13:46:27] (03PS1) 10Filippo Giunchedi: prometheus: drop k8s pods-related metrics from cadvisor in 'ops' [puppet] - 10https://gerrit.wikimedia.org/r/925781 (https://phabricator.wikimedia.org/T337856) [13:47:48] (03PS2) 10Fabfur: hiera: Swap port 80 from varnish to haproxy on drmrs caching clusters [puppet] - 10https://gerrit.wikimedia.org/r/925779 (https://phabricator.wikimedia.org/T323557) [13:49:11] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [13:49:32] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [13:50:26] !log jelto@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [13:50:47] !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [13:51:40] (03PS1) 10MVernon: swift: remove support for pre-bullseye [puppet] - 10https://gerrit.wikimedia.org/r/925807 (https://phabricator.wikimedia.org/T279637) [13:51:55] (03CR) 10Vgutierrez: [C: 03+1] "LGTM: don't forget to disable puppet on A:cp-drmrs before merging this CR" [puppet] - 10https://gerrit.wikimedia.org/r/925779 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [13:51:56] !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [13:52:07] (03CR) 10CI reject: [V: 04-1] swift: remove support for pre-bullseye [puppet] - 10https://gerrit.wikimedia.org/r/925807 (https://phabricator.wikimedia.org/T279637) (owner: 10MVernon) [13:52:28] !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [13:52:56] !log installing sysstat security updates [13:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:24] (03PS2) 10MVernon: swift: remove support for pre-bullseye [puppet] - 10https://gerrit.wikimedia.org/r/925807 (https://phabricator.wikimedia.org/T279637) [13:55:36] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T337705 (10Jhancock.wm) 05Open→03Resolved known issue. resolving [13:58:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P48699 and previous config saved to /var/cache/conftool/dbconfig/20230601-135811-ladsgroup.json [13:58:56] (03CR) 10Muehlenhoff: [C: 03+2] swift: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/925656 (owner: 10Muehlenhoff) [14:00:51] (03PS7) 10Effie Mouzeli: ipoid: Create iPoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/921700 (https://phabricator.wikimedia.org/T336163) [14:01:39] (03CR) 10CI reject: [V: 04-1] ipoid: Create iPoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/921700 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [14:03:12] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/925807 (https://phabricator.wikimedia.org/T279637) (owner: 10MVernon) [14:04:07] 10ops-knams, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic, 10netops: Q4/Q1:knams racking elevations & planning - https://phabricator.wikimedia.org/T331886 (10cmooney) >>! In T331886#8688489, @RobH wrote: >>>! In T331886#8688402, @ayounsi wrote: >> Ideally we should also have 1 patch panel per rack.... [14:04:24] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/925807 (https://phabricator.wikimedia.org/T279637) (owner: 10MVernon) [14:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:09] !log Removing obsolete mediawiki-services-function-evaluator from registry - T337505 [14:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:12] T337505: Delete obsolete Abstract Wiki images mediawiki-services-function-orchestrator and mediawiki-services-function-evaluator from registry - https://phabricator.wikimedia.org/T337505 [14:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:13:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T336886)', diff saved to https://phabricator.wikimedia.org/P48700 and previous config saved to /var/cache/conftool/dbconfig/20230601-141317-ladsgroup.json [14:13:20] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [14:13:21] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@3c9cc85]: (no justification provided) [14:13:32] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@3c9cc85]: (no justification provided) (duration: 00m 11s) [14:14:12] !log Disabled puppet on A:cp-drmrs for T323557 [14:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:14] T323557: Let HAProxy handle port 80 - https://phabricator.wikimedia.org/T323557 [14:14:29] (03PS1) 10Herron: mwlog1002: add python exemption [puppet] - 10https://gerrit.wikimedia.org/r/925813 (https://phabricator.wikimedia.org/T333614) [14:15:52] (03PS4) 10Jbond: puppet-merge: implement Lock out, tag out [puppet] - 10https://gerrit.wikimedia.org/r/922915 (https://phabricator.wikimedia.org/T248872) [14:16:17] !log herron@cumin1001 START - Cookbook sre.hosts.reimage for host mwlog1002.eqiad.wmnet with OS bullseye [14:16:27] (03CR) 10CI reject: [V: 04-1] puppet-merge: implement Lock out, tag out [puppet] - 10https://gerrit.wikimedia.org/r/922915 (https://phabricator.wikimedia.org/T248872) (owner: 10Jbond) [14:16:30] (03CR) 10Fabfur: [C: 03+2] "Ready for merging" [puppet] - 10https://gerrit.wikimedia.org/r/925779 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:54] (03CR) 10Jbond: "updated" [puppet] - 10https://gerrit.wikimedia.org/r/922915 (https://phabricator.wikimedia.org/T248872) (owner: 10Jbond) [14:17:53] (03PS5) 10Jbond: puppet-merge: implement Lock out, tag out [puppet] - 10https://gerrit.wikimedia.org/r/922915 (https://phabricator.wikimedia.org/T248872) [14:18:03] (03PS1) 10Ayounsi: device validator: fix bug where no asset_tag is defined [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/925817 [14:19:29] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/925817 (owner: 10Ayounsi) [14:25:14] (03CR) 10Ayounsi: [C: 03+2] device validator: fix bug where no asset_tag is defined [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/925817 (owner: 10Ayounsi) [14:29:34] !log installing imagemagick security updates on buster [14:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:54] (03CR) 10Jforrester: [C: 03+1] releases: clone repos/releng/release from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) (owner: 10Reedy) [14:33:33] (03CR) 10Herron: [C: 03+2] mwlog1002: add python exemption [puppet] - 10https://gerrit.wikimedia.org/r/925813 (https://phabricator.wikimedia.org/T333614) (owner: 10Herron) [14:34:49] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye [14:34:56] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet wi... [14:36:09] !log fabfur@cumin1001 START - Cookbook sre.cdn.run-puppet-restart-varnish rolling custom on A:cp-text_drmrs and A:cp [14:39:33] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [14:39:54] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [14:40:37] !log fabfur@cumin1001 START - Cookbook sre.cdn.run-puppet-restart-varnish rolling custom on A:cp-upload_drmrs and A:cp [14:41:36] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mwlog1002.eqiad.wmnet with reason: host reimage [14:41:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: codfw1dev: rework rabbitmq CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/925761 (https://phabricator.wikimedia.org/T336808) (owner: 10Arturo Borrero Gonzalez) [14:43:31] (03Abandoned) 10Arturo Borrero Gonzalez: wikimediacloud.org: move openstack.codfw1dev.wikimediacloud.org to new VIP [dns] - 10https://gerrit.wikimedia.org/r/918525 (https://phabricator.wikimedia.org/T332153) (owner: 10Arturo Borrero Gonzalez) [14:44:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: codfw1dev: refresh rabbitmq nodes [puppet] - 10https://gerrit.wikimedia.org/r/925762 (https://phabricator.wikimedia.org/T336808) (owner: 10Arturo Borrero Gonzalez) [14:44:48] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwlog1002.eqiad.wmnet with reason: host reimage [14:45:53] !log running run-puppet-agent on cp6009.drmrs.wmnet to fix icinga check from cookbook [14:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:22] (03CR) 10Ahmon Dancy: [C: 03+1] "Thanks Dzahn!" [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) (owner: 10Reedy) [14:48:17] (03PS1) 10Ottomata: mw-page-content-change-enrich - code comment correction [deployment-charts] - 10https://gerrit.wikimedia.org/r/925821 [14:49:44] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [14:53:09] !log installing jackson-databind security updates [14:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudswift1001.eqiad.wmnet with OS bullseye [14:55:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with... [14:55:54] !log aborrero@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye [14:56:01] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with O... [14:56:16] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye [14:56:22] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [14:56:24] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet wi... [14:59:18] !log installing python-sqlparse security updates [14:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:48] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mwlog1002.eqiad.wmnet with OS bullseye [15:05:11] (03PS1) 10Muehlenhoff: Cloud VPS: Remove support for stretch in various roles [puppet] - 10https://gerrit.wikimedia.org/r/925831 [15:11:12] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard1003 is OK: HTTP OK: HTTP/1.1 200 OK - 7688 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [15:11:17] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10aborrero) In the debian installer, I'm getting a failure related to the RAID: {F37088667} Upon investigation I found that: ` Jun 1... [15:11:53] !log reprepro -C component/pybal bullseye-wikimedia pybal_1.15.13_source.changes [15:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:56] mutante: Is it expected that the host key of mwlog1002.eqiad.wmnet changed? [15:12:19] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [15:14:15] Nevermind. I just ran wmf-update-known-hosts-production and I'm back in action. [15:15:19] dancy: that was me, its expected (host reimaged) [15:15:32] !log lvs400[89]: upgrade pybal to 1.15.13 - T334703 [15:15:36] 👍🏾 [15:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:37] T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 [15:17:53] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10aborrero) More data. The disk seems detected by the installer at boot: ` # grep -i sda /var/log/syslog 17:14 Jun 1 15:01:49... [15:19:55] 10SRE, 10Security-Team, 10WMF-General-or-Unknown, 10Wikimedia-Apache-configuration, and 2 others: Add security.txt to Wikimedia sites? (2023 edition) - https://phabricator.wikimedia.org/T337949 (10Reedy) [15:21:20] !log running run-puppet-agent on cp6010.drmrs.wmnet to fix icinga check from cookbook [15:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:05] (03CR) 10BryanDavis: python: Replace --mount with --wsgi-file in webservice-runner (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/925099 (https://phabricator.wikimedia.org/T337897) (owner: 10BryanDavis) [15:26:13] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10cmooney) It's odd, fdisk only detects two drives, but they are **sdb** and **sdc**: ` /var/log # fdisk -l Disk /dev/sdb: 1.75 TiB, 192... [15:31:58] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10Reedy) [15:31:59] (03CR) 10AikoChou: Declare mediawiki.page_outlink_topic_prediction_change.v1 stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923571 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [15:33:43] !log aborrero@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye [15:33:50] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with O... [15:33:55] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye [15:34:01] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet wi... [15:35:08] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10cmooney) I believe based on [[ https://www.dell.com/community/PowerEdge-OS-Forum/T320-How-to-get-rid-of-OS-deployment-driver-volumes/td... [15:35:35] (03PS1) 10Elukey: kserve-inference: use dict instead of lists for inference_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/925844 [15:36:36] (03PS1) 10Jbond: puppetboard-next: add a new name for the puppet7 migration [dns] - 10https://gerrit.wikimedia.org/r/925845 (https://phabricator.wikimedia.org/T330490) [15:36:52] (03PS1) 10Jbond: service::catalog: Add puppetboard-next service for puppet7 migration [puppet] - 10https://gerrit.wikimedia.org/r/925846 (https://phabricator.wikimedia.org/T330490) [15:37:19] (03CR) 10CI reject: [V: 04-1] service::catalog: Add puppetboard-next service for puppet7 migration [puppet] - 10https://gerrit.wikimedia.org/r/925846 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:39:53] (03PS2) 10Elukey: kserve-inference: use dict instead of lists for inference_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/925844 [15:43:31] (03PS1) 10Jbond: puppetboard::bookworm: update discovery name [puppet] - 10https://gerrit.wikimedia.org/r/925847 (https://phabricator.wikimedia.org/T330490) [15:43:33] (03PS1) 10Cwhite: opensearch_dashboards: fix package name typo [puppet] - 10https://gerrit.wikimedia.org/r/925113 (https://phabricator.wikimedia.org/T320620) [15:43:56] (03PS3) 10Elukey: kserve-inference: use dict instead of lists for inference_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/925844 [15:43:59] (03CR) 10Jbond: [C: 03+2] puppetboard::bookworm: update discovery name [puppet] - 10https://gerrit.wikimedia.org/r/925847 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:44:48] !log aborrero@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye [15:44:56] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with O... [15:45:18] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye [15:45:26] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet wi... [15:47:05] (03CR) 10Cwhite: [C: 03+2] opensearch_dashboards: fix package name typo [puppet] - 10https://gerrit.wikimedia.org/r/925113 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [15:50:55] (03PS8) 10Effie Mouzeli: ipoid: Create iPoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/921700 (https://phabricator.wikimedia.org/T336163) [15:57:02] (03PS3) 10Effie Mouzeli: ipoid: add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/921707 (https://phabricator.wikimedia.org/T336163) [15:57:09] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudswift1001.eqiad.wmnet with OS bullseye [15:57:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS... [15:57:40] (03CR) 10CI reject: [V: 04-1] ipoid: add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/921707 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [15:57:46] (03PS1) 10Papaul: fix typo for cloudswift100[1-2] in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/925850 (https://phabricator.wikimedia.org/T289882) [15:58:47] (03CR) 10Papaul: [C: 03+2] fix typo for cloudswift100[1-2] in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/925850 (https://phabricator.wikimedia.org/T289882) (owner: 10Papaul) [15:59:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudswift1001.eqiad.wmnet with OS bullseye [15:59:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with... [16:00:05] jbond and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230601T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:11] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudswift1001.eqiad.wmnet with OS bullseye [16:00:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS... [16:01:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudswift1001.eqiad.wmnet with OS bullseye [16:01:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with... [16:02:36] (03PS1) 10JMeybohm: kubernetes: Set default service_node_port_range to 30000-32767 [puppet] - 10https://gerrit.wikimedia.org/r/925851 (https://phabricator.wikimedia.org/T328291) [16:04:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudswift1001.eqiad.wmnet with reason: host reimage [16:05:42] (03PS2) 10Jbond: service::catalog: Add puppetboard-next service for puppet7 migration [puppet] - 10https://gerrit.wikimedia.org/r/925846 (https://phabricator.wikimedia.org/T330490) [16:06:00] !log gerrit - set repo wikimedia/annualreport to readonly (from active) - T337041 [16:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:03] T337041: move micro site annual.wikimedia.org and 15.wikipedia.org to kubernetes - https://phabricator.wikimedia.org/T337041 [16:07:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudswift1001.eqiad.wmnet with reason: host reimage [16:07:22] !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2004-dev.codfw.wmnet with reason: host reimage [16:08:11] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41504/console" [puppet] - 10https://gerrit.wikimedia.org/r/925851 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [16:09:23] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM overall, one stylistic preference you can safely ignore in the comment 😊" [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [16:09:49] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes: Set default service_node_port_range to 30000-32767 [puppet] - 10https://gerrit.wikimedia.org/r/925851 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [16:10:16] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "nicely done!" [puppet] - 10https://gerrit.wikimedia.org/r/924596 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [16:10:36] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2004-dev.codfw.wmnet with reason: host reimage [16:14:09] (03PS6) 10Jbond: puppet-merge: implement Lock out, tag out [puppet] - 10https://gerrit.wikimedia.org/r/922915 (https://phabricator.wikimedia.org/T248872) [16:16:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Jclark-ctr let me know when it might suit to try and get more of these moves done. Thanks. [16:17:45] (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: Create iPoid chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/921700 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [16:22:10] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:22:30] (03PS9) 10Effie Mouzeli: ipoid: Create iPoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/921700 (https://phabricator.wikimedia.org/T336163) [16:23:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:23:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudswift1001.eqiad.wmnet with OS bullseye [16:24:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS... [16:27:49] (03PS2) 10Ottomata: EventStreamConfig - page_change - Remove unused streams and settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924998 (https://phabricator.wikimedia.org/T336817) [16:28:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Jhancock.wm) [16:29:11] (03CR) 10Ottomata: [C: 03+2] EventStreamConfig - page_change - Remove unused streams and settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924998 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [16:30:07] (03Merged) 10jenkins-bot: EventStreamConfig - page_change - Remove unused streams and settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924998 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [16:30:30] (03PS1) 10AikoChou: changeprop: allow match_not in match_config for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/925852 (https://phabricator.wikimedia.org/T328899) [16:30:58] (03PS12) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) [16:31:00] (03PS5) 10BCornwall: sre.cdn: move common functions to base class [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [16:31:19] (03PS4) 10Effie Mouzeli: ipoid: add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/921707 (https://phabricator.wikimedia.org/T336163) [16:31:21] (03CR) 10CI reject: [V: 04-1] changeprop: allow match_not in match_config for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/925852 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [16:31:45] (03CR) 10Dzahn: "I feel like I should add background: The situation here is as such:" [puppet] - 10https://gerrit.wikimedia.org/r/924604 (owner: 10Dzahn) [16:32:29] !log lvs400[89]: upgrade pybal to 1.15.13 - T334703 (round 2!) [16:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:33] T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 [16:33:15] (03CR) 10CI reject: [V: 04-1] sre.cdn: move common functions to base class [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [16:33:39] (03CR) 10CI reject: [V: 04-1] Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [16:35:16] (03PS1) 10Cwhite: opensearch_dashboards: remove alerting and observability plugins [puppet] - 10https://gerrit.wikimedia.org/r/925114 (https://phabricator.wikimedia.org/T333732) [16:35:17] !log lvs5* (eqsin): upgrade pybal to 1.15.13 - T334703 [16:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:21] (03CR) 10Dzahn: [C: 03+1] miscweb: change path for readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/925751 (https://phabricator.wikimedia.org/T337041) (owner: 10Jelto) [16:37:25] (03CR) 10Dzahn: "are there actually no more strech VMs in cloud?" [puppet] - 10https://gerrit.wikimedia.org/r/925831 (owner: 10Muehlenhoff) [16:38:06] (03CR) 10Dzahn: "+1 for the "wikistats" part." [puppet] - 10https://gerrit.wikimedia.org/r/925831 (owner: 10Muehlenhoff) [16:38:20] (03Abandoned) 10EoghanGaffney: Change doc hosts to use rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/920310 (https://phabricator.wikimedia.org/T333945) (owner: 10EoghanGaffney) [16:40:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudswift1002.eqiad.wmnet with OS bullseye [16:40:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with... [16:40:22] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudswift1002.eqiad.wmnet with OS bullseye [16:40:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with OS... [16:42:30] !log lvs2* (codfw): upgrade pybal to 1.15.13 - T334703 [16:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:33] T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 [16:43:01] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/925845 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [16:43:29] (03CR) 10Effie Mouzeli: ipoid: Create iPoid chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/921700 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [16:43:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:43:45] (03CR) 10Muehlenhoff: Cloud VPS: Remove support for stretch in various roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925831 (owner: 10Muehlenhoff) [16:44:56] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: no-op: Remove undeeded wgEventBusStreamNamesMap override setting - T336817 (duration: 08m 18s) [16:44:59] T336817: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817 [16:47:27] (03PS1) 10Ottomata: Revert "EventStreamConfig - page_change - Remove unused streams and settings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925791 [16:47:34] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Revert "EventStreamConfig - page_change - Remove unused streams and settings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925791 (owner: 10Ottomata) [16:48:34] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (PATCH nodes) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:52:37] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin2002" [16:53:39] !log aborrero@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin2002" [16:53:39] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye [16:53:47] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with O... [16:55:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudswift1002.eqiad.wmnet with OS bullseye [16:55:36] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: revert: Remove undeeded wgEventBusStreamNamesMap override setting. Recent EventBus changes are not deployed yet? - T336817 (duration: 07m 24s) [16:55:39] T336817: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817 [16:55:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with... [16:55:43] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudswift1002.eqiad.wmnet with OS bullseye [16:55:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with OS... [16:59:40] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10aborrero) 05Open→03Resolved [17:00:06] bd808: I, the Bot under the Fountain, call upon thee, The Deployer, to do Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230601T1700). [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230601T1700) [17:01:36] (03CR) 10Nskaggs: [C: 03+1] python: Replace --mount with --wsgi-file in webservice-runner (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/925099 (https://phabricator.wikimedia.org/T337897) (owner: 10BryanDavis) [17:01:56] (03CR) 10Nskaggs: [C: 03+1] python: Replace --mount with --wsgi-file in webservice-runner (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/925099 (https://phabricator.wikimedia.org/T337897) (owner: 10BryanDavis) [17:05:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudswift1002.eqiad.wmnet with OS bullseye [17:05:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with... [17:05:40] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudswift1002.eqiad.wmnet with OS bullseye [17:05:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with OS... [17:07:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Jhancock.wm) [17:12:45] (03PS1) 10Esanders: Remove deleted config wgVectorStickyHeaderEdit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925858 (https://phabricator.wikimedia.org/T337955) [17:23:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Jhancock.wm) @Jclark-ctr when you have a moment/back can you swap the ports on the NIC? thanks! [17:38:16] (03PS1) 10Hnowlan: device-analyics: deploy new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/925859 (https://phabricator.wikimedia.org/T320967) [17:40:22] (03CR) 10Hnowlan: [C: 03+2] device-analyics: deploy new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/925859 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [17:41:13] (03Merged) 10jenkins-bot: device-analyics: deploy new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/925859 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [17:42:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:45:23] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [17:45:55] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [17:47:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:47:04] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [17:47:35] (03PS2) 10ArielGlenn: fix up regex comparisons in dumps nfs share testing script [puppet] - 10https://gerrit.wikimedia.org/r/924874 (https://phabricator.wikimedia.org/T325232) [17:47:36] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [17:48:25] (03PS1) 10JHathaway: expand_path, regex_data: use yaml safe_load when available [puppet] - 10https://gerrit.wikimedia.org/r/925862 (https://phabricator.wikimedia.org/T330495) [17:48:28] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.run-puppet-restart-varnish (exit_code=0) rolling custom on A:cp-text_drmrs and A:cp [17:48:53] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [17:49:32] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [17:50:33] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@03ca1c1]: (no justification provided) [17:50:35] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.run-puppet-restart-varnish (exit_code=0) rolling custom on A:cp-upload_drmrs and A:cp [17:50:44] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@03ca1c1]: (no justification provided) (duration: 00m 10s) [17:52:05] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/925862 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [17:56:01] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) [18:00:06] dduvall and ^demon: That opportune time is upon us again. Time for a MediaWiki train - Utc-7 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230601T1800). [18:02:24] (03CR) 10Hokwelum: [C: 03+1] "looks good! Thank you :-)" [puppet] - 10https://gerrit.wikimedia.org/r/924874 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [18:08:37] (03CR) 10Hokwelum: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/924887 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [18:12:01] 10SRE, 10Release-Engineering-Team, 10Security-Team, 10Wikimedia-GitHub, and 3 others: Add github.com/wikimedia as an SCM for Semgrep Cloud - https://phabricator.wikimedia.org/T337561 (10sbassett) 05Open→03Declined I'm going to decline this for now as @bcampbell and I set up Wikimedia's Okta as an SSO p... [18:20:29] (03PS1) 10JHathaway: java: ensure wmf-certificates is installed, when required [puppet] - 10https://gerrit.wikimedia.org/r/925873 (https://phabricator.wikimedia.org/T337972) [18:21:12] (03PS2) 10JHathaway: expand_path, regex_data: use yaml safe_load when available [puppet] - 10https://gerrit.wikimedia.org/r/925862 (https://phabricator.wikimedia.org/T330495) [18:21:53] (03PS3) 10JHathaway: expand_path, regex_data: use yaml safe_load when available [puppet] - 10https://gerrit.wikimedia.org/r/925862 (https://phabricator.wikimedia.org/T337972) [18:22:17] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/925862 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [18:22:51] (03CR) 10CI reject: [V: 04-1] java: ensure wmf-certificates is installed, when required [puppet] - 10https://gerrit.wikimedia.org/r/925873 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [18:23:48] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925874 (https://phabricator.wikimedia.org/T337525) [18:23:50] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925874 (https://phabricator.wikimedia.org/T337525) (owner: 10TrainBranchBot) [18:24:33] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925874 (https://phabricator.wikimedia.org/T337525) (owner: 10TrainBranchBot) [18:27:42] (03PS1) 10Kimberly Sarabia: Remove config and AB test code for edit buttons in sticky header [skins/Vector] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/925792 (https://phabricator.wikimedia.org/T337955) [18:29:26] (03PS6) 10Eevans: hieradata: upgrade cassandra-dev2002 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924611 (https://phabricator.wikimedia.org/T313814) [18:29:28] (03PS6) 10Eevans: hieradata: upgrade cassandra-dev2003 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924612 (https://phabricator.wikimedia.org/T313814) [18:32:09] (03PS2) 10JHathaway: java: ensure wmf-certificates is installed, when required [puppet] - 10https://gerrit.wikimedia.org/r/925873 (https://phabricator.wikimedia.org/T337972) [18:32:10] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.11 refs T337525 [18:32:14] T337525: 1.41.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T337525 [18:33:29] !log lvs3* (esams): upgrade pybal to 1.15.13 - T334703 [18:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:32] T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 [18:33:54] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:34:04] (03PS7) 10Eevans: hieradata: upgrade cassandra-dev2002 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924611 (https://phabricator.wikimedia.org/T313814) [18:34:06] (03PS7) 10Eevans: hieradata: upgrade cassandra-dev2003 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924612 (https://phabricator.wikimedia.org/T313814) [18:34:29] (03PS1) 10JHathaway: ferm: Ensure iptables is installed before configuring alternatives [puppet] - 10https://gerrit.wikimedia.org/r/925877 (https://phabricator.wikimedia.org/T337972) [18:34:42] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/924611 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [18:35:19] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/925873 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [18:35:36] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [18:37:02] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [18:40:46] (03CR) 10Eevans: [C: 03+2] hieradata: upgrade cassandra-dev2002 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924611 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [18:44:53] (03PS1) 10JHathaway: bookworm: Change to deb822 format for sources.list [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) [18:45:27] !log lvs6* (drmrs): upgrade pybal to 1.15.13 - T334703 [18:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:30] T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 [18:45:37] (03CR) 10CI reject: [V: 04-1] Remove config and AB test code for edit buttons in sticky header [skins/Vector] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/925792 (https://phabricator.wikimedia.org/T337955) (owner: 10Kimberly Sarabia) [18:46:30] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/925877 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [18:46:46] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [18:47:22] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/924612 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [18:51:53] jouncebot: nowandnext [18:51:53] For the next 1 hour(s) and 8 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230601T1800) [18:51:53] In 1 hour(s) and 8 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230601T2000) [18:52:06] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10GPSLeo) I also get many 5xx and stashfailed errors on my current uploads using python requests library.... [19:01:08] (03CR) 10Herron: [C: 03+1] prometheus: drop k8s pods-related metrics from cadvisor in 'ops' [puppet] - 10https://gerrit.wikimedia.org/r/925781 (https://phabricator.wikimedia.org/T337856) (owner: 10Filippo Giunchedi) [19:08:24] !log bblack@deploy1002 Locking from deployment [ALL REPOSITORIES]: temporary lock for LVS/pybal upgrade work [19:09:11] !log lvs1* (eqiad): upgrade pybal to 1.15.13 - T334703 [19:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:14] T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 [19:11:51] !log bblack@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: temporary lock for LVS/pybal upgrade work (duration: 03m 27s) [19:11:52] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@21e7354]: (no justification provided) [19:12:02] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@21e7354]: (no justification provided) (duration: 02m 42s) [19:12:21] (03PS1) 10JHathaway: puppetserver: hiera type defs [puppet] - 10https://gerrit.wikimedia.org/r/925893 (https://phabricator.wikimedia.org/T337972) [19:17:17] (03CR) 10Eevans: [C: 03+2] hieradata: upgrade cassandra-dev2003 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924612 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [19:21:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [19:31:39] (03PS1) 10Eevans: hieradata: move cassandra_dev per-host settings back to role [puppet] - 10https://gerrit.wikimedia.org/r/925912 (https://phabricator.wikimedia.org/T337344) [19:31:51] (03CR) 10Dzahn: [C: 03+1] Cloud VPS: Remove support for stretch in various roles [puppet] - 10https://gerrit.wikimedia.org/r/925831 (owner: 10Muehlenhoff) [19:33:29] (03CR) 10Eevans: [C: 03+2] hieradata: move cassandra_dev per-host settings back to role [puppet] - 10https://gerrit.wikimedia.org/r/925912 (https://phabricator.wikimedia.org/T337344) (owner: 10Eevans) [19:38:45] (03CR) 10Dzahn: [C: 03+2] trafficserver: switch bienvenida.wikimedia.org backend [puppet] - 10https://gerrit.wikimedia.org/r/925727 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [19:38:52] (03PS2) 10Dzahn: trafficserver: switch bienvenida.wikimedia.org backend [puppet] - 10https://gerrit.wikimedia.org/r/925727 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [19:44:36] (03PS1) 10Krinkle: webperf: Remove remnants of coal and coal-web [puppet] - 10https://gerrit.wikimedia.org/r/925918 (https://phabricator.wikimedia.org/T335242) [19:47:53] (03CR) 10Dzahn: "deployed to cp4* first, because ULSFO is my local DC. After running puppet on the 16 cp4 hosts I could see that there are no more new log " [puppet] - 10https://gerrit.wikimedia.org/r/925727 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [19:48:06] (03PS9) 10Krinkle: webperf: Fix /excimer/ POST restriction [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) [19:49:23] (03CR) 10Dzahn: "just deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/925727 and things look fine for me on ULSFO. I am giving it a few hours" [puppet] - 10https://gerrit.wikimedia.org/r/923655 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [19:50:08] (03PS1) 10JHathaway: puppetserver: add additional config options [puppet] - 10https://gerrit.wikimedia.org/r/925919 (https://phabricator.wikimedia.org/T337972) [19:50:57] (03PS2) 10JHathaway: puppetserver: add additional config options [puppet] - 10https://gerrit.wikimedia.org/r/925919 (https://phabricator.wikimedia.org/T337972) [19:51:37] (03PS1) 10Cwhite: prometheus: re-enable swagger jobs from service catalog [puppet] - 10https://gerrit.wikimedia.org/r/925117 (https://phabricator.wikimedia.org/T320620) [19:51:43] (03CR) 10Dzahn: "[deploy1002:~] $ curl -I --resolve bienvenida.wikimedia.org:30443:$(dig +short k8s-ingress-wikikube.svc.eqiad.wmnet) https://bienvenida.wi" [puppet] - 10https://gerrit.wikimedia.org/r/923656 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [19:51:51] (03PS1) 10Eevans: cassandra_dev: make client encryption non-optional [puppet] - 10https://gerrit.wikimedia.org/r/925920 (https://phabricator.wikimedia.org/T337344) [19:53:19] (03PS2) 10Krinkle: webperf: Remove remnants of coal and coal-web [puppet] - 10https://gerrit.wikimedia.org/r/925918 (https://phabricator.wikimedia.org/T335242) [19:53:21] (03PS10) 10Krinkle: webperf: Fix /excimer/ POST restriction [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) [19:53:33] (03PS1) 10Daimona Eaytoy: filtered_tables.txt: Update for CampaignEvents schema change [puppet] - 10https://gerrit.wikimedia.org/r/925921 (https://phabricator.wikimedia.org/T337940) [19:53:56] (03PS1) 10JHathaway: puppetserver: fix permadiff on $ssl_dir [puppet] - 10https://gerrit.wikimedia.org/r/925922 (https://phabricator.wikimedia.org/T337972) [19:53:58] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/925781 (https://phabricator.wikimedia.org/T337856) (owner: 10Filippo Giunchedi) [19:54:36] (03PS3) 10Krinkle: webperf: Remove remnants of coal and coal-web [puppet] - 10https://gerrit.wikimedia.org/r/925918 (https://phabricator.wikimedia.org/T335242) [19:54:40] (03CR) 10Eevans: [C: 03+2] cassandra_dev: make client encryption non-optional [puppet] - 10https://gerrit.wikimedia.org/r/925920 (https://phabricator.wikimedia.org/T337344) (owner: 10Eevans) [19:54:44] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/925918 (https://phabricator.wikimedia.org/T335242) (owner: 10Krinkle) [19:54:56] (03PS11) 10Krinkle: webperf: Fix /excimer/ POST restriction [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) [19:55:05] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [19:55:10] (03PS12) 10Krinkle: webperf: Fix /excimer/ POST restriction [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) [19:55:12] (03CR) 10CI reject: [V: 04-1] webperf: Fix /excimer/ POST restriction [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [19:56:38] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [19:56:40] (03CR) 10Dzahn: microsites: remove http blackbox monitor for 15.wikipedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922918 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [19:57:49] (03CR) 10Cwhite: [C: 03+2] prometheus: re-enable swagger jobs from service catalog [puppet] - 10https://gerrit.wikimedia.org/r/925117 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [19:59:54] (03PS5) 10DDesouza: Deploy Research Incentive survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917863 (https://phabricator.wikimedia.org/T336092) [20:00:06] TheresNoTime: #bothumor I � Unicode. All rise for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230601T2000). [20:00:06] Dreamy_Jazz and kimberly_sarabia: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] \o [20:00:18] hello [20:00:30] Unable to test the patch I'm requesting, as I don't have CU rights on loginwiki [20:00:33] I can deploy :) [20:01:42] (03PS4) 10Samtar: Always collapse by default the CheckUserHelper on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886370 (https://phabricator.wikimedia.org/T328726) (owner: 10Dreamy Jazz) [20:02:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886370 (https://phabricator.wikimedia.org/T328726) (owner: 10Dreamy Jazz) [20:03:11] My patch was removed from the deployment page by another user by accident. Will re-add it. [20:03:20] ^^ [20:03:42] (03Merged) 10jenkins-bot: Always collapse by default the CheckUserHelper on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886370 (https://phabricator.wikimedia.org/T328726) (owner: 10Dreamy Jazz) [20:03:59] !log samtar@deploy1002 Started scap: Backport for [[gerrit:886370|Always collapse by default the CheckUserHelper on loginwiki (T328726)]] [20:04:02] T328726: Allow the CheckUserHelper script to be always collapsed through a config - https://phabricator.wikimedia.org/T328726 [20:05:27] !log samtar@deploy1002 samtar and dreamyjazz: Backport for [[gerrit:886370|Always collapse by default the CheckUserHelper on loginwiki (T328726)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [20:05:32] urbanecm: ^ [20:05:47] Dreamy_Jazz: I've asked urbanecm to take a little look, what are they looking for? [20:06:06] When they run a check, the checkuser helper table should be collapsed by default [20:06:12] that is indeed the case [20:06:17] (03PS1) 10Eevans: cassandra_dev: set legacy_ssl_storage_port_enabled false [puppet] - 10https://gerrit.wikimedia.org/r/925932 (https://phabricator.wikimedia.org/T337344) [20:06:35] (I'm kind of surprised I have CU results for myself on loginwiki) [20:06:43] awesome, will sync [20:07:00] (03CR) 10Eevans: [C: 03+2] cassandra_dev: set legacy_ssl_storage_port_enabled false [puppet] - 10https://gerrit.wikimedia.org/r/925932 (https://phabricator.wikimedia.org/T337344) (owner: 10Eevans) [20:07:06] Thanks for testing urbanecm! [20:07:10] no problem! [20:07:39] (03PS6) 10Samtar: Deploy Research Incentive survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917863 (https://phabricator.wikimedia.org/T336092) (owner: 10DDesouza) [20:09:03] (03CR) 10Samtar: "recheck" [skins/Vector] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/925792 (https://phabricator.wikimedia.org/T337955) (owner: 10Kimberly Sarabia) [20:09:33] (03PS1) 10JHathaway: add container facts [puppet] - 10https://gerrit.wikimedia.org/r/925935 (https://phabricator.wikimedia.org/T337972) [20:12:19] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:886370|Always collapse by default the CheckUserHelper on loginwiki (T328726)]] (duration: 08m 20s) [20:12:23] T328726: Allow the CheckUserHelper script to be always collapsed through a config - https://phabricator.wikimedia.org/T328726 [20:12:32] Dreamy_Jazz: live :) [20:12:42] danisztls: doing yours next [20:12:45] (03CR) 10CI reject: [V: 04-1] add container facts [puppet] - 10https://gerrit.wikimedia.org/r/925935 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [20:12:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917863 (https://phabricator.wikimedia.org/T336092) (owner: 10DDesouza) [20:12:48] Thanks! [20:13:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:13:34] (03Merged) 10jenkins-bot: Deploy Research Incentive survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917863 (https://phabricator.wikimedia.org/T336092) (owner: 10DDesouza) [20:13:52] !log samtar@deploy1002 Started scap: Backport for [[gerrit:917863|Deploy Research Incentive survey on enwiki (T336092)]] [20:13:55] T336092: Deploy Research Incentive Survey on English Wikipedia - https://phabricator.wikimedia.org/T336092 [20:14:52] (03PS11) 10BBlack: pybal: configure failover i13n IPs [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) [20:14:54] (03PS5) 10BBlack: safe-service-restart: use failover i13n [puppet] - 10https://gerrit.wikimedia.org/r/924596 (https://phabricator.wikimedia.org/T334703) [20:15:19] !log samtar@deploy1002 dani and samtar: Backport for [[gerrit:917863|Deploy Research Incentive survey on enwiki (T336092)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:15:20] kimberly_sarabia: FYI, I triggered a `recheck` for 925792 as it failed CI [20:15:29] danisztls: live on mwdebug, can you test? [20:15:35] yup i saw [20:15:36] TheresNoTime: yes, thanks [20:15:55] TheresNoTime: looks good [20:15:59] (03CR) 10BBlack: pybal: configure failover i13n IPs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [20:16:02] syncing [20:18:11] (03CR) 10BBlack: "New PCC output looks good for both changes: https://puppet-compiler.wmflabs.org/output/924596/41506/" [puppet] - 10https://gerrit.wikimedia.org/r/924596 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [20:18:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:18:48] !status Ok [20:19:23] in other channels like -cloud wm-bot would now change the topic, see there [20:19:36] probably lacking privs here [20:19:58] it's a neat feature though for here as well [20:20:25] sorry, not wm-bot, wmopbot. [20:20:53] > Flags for wmopbot in #wikimedia-operations are +Vov. [20:20:57] needs +t I think? [20:21:22] 19:59 < danilo> wmopbot, it can do that in any channel it has +t and the topic has a "status:" section [20:21:25] ack [20:21:49] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:917863|Deploy Research Incentive survey on enwiki (T336092)]] (duration: 07m 56s) [20:21:52] T336092: Deploy Research Incentive Survey on English Wikipedia - https://phabricator.wikimedia.org/T336092 [20:21:54] danisztls: live on prod :) [20:22:51] (03PS1) 10Cwhite: prometheus: fix swagger job relabel configs [puppet] - 10https://gerrit.wikimedia.org/r/925118 (https://phabricator.wikimedia.org/T320620) [20:25:21] (03CR) 10Cwhite: [C: 03+2] prometheus: fix swagger job relabel configs [puppet] - 10https://gerrit.wikimedia.org/r/925118 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [20:26:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/925792 (https://phabricator.wikimedia.org/T337955) (owner: 10Kimberly Sarabia) [20:27:50] (03PS2) 10Samtar: Remove deleted config wgVectorStickyHeaderEdit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925858 (https://phabricator.wikimedia.org/T337955) (owner: 10Esanders) [20:31:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [20:32:54] ^ was probably related to T337991 [20:32:55] T337991: API request failed (backend-fail-internal): An unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T337991 [20:44:16] (03Merged) 10jenkins-bot: Remove config and AB test code for edit buttons in sticky header [skins/Vector] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/925792 (https://phabricator.wikimedia.org/T337955) (owner: 10Kimberly Sarabia) [20:44:29] !log samtar@deploy1002 Started scap: Backport for [[gerrit:925792|Remove config and AB test code for edit buttons in sticky header (T337955)]] [20:44:32] T337955: Edit buttons not appearing within sticky header - https://phabricator.wikimedia.org/T337955 [20:45:57] !log samtar@deploy1002 samtar and ksarabia: Backport for [[gerrit:925792|Remove config and AB test code for edit buttons in sticky header (T337955)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:46:15] kimberly_sarabia: that one is live on mwdebug, can you test it? [20:46:22] yep one moment [20:48:42] LGTM [20:48:48] syncing [20:54:58] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:925792|Remove config and AB test code for edit buttons in sticky header (T337955)]] (duration: 10m 29s) [20:55:01] moving on to 925858 [20:55:02] T337955: Edit buttons not appearing within sticky header - https://phabricator.wikimedia.org/T337955 [20:55:43] (03PS3) 10Samtar: Remove deleted config wgVectorStickyHeaderEdit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925858 (https://phabricator.wikimedia.org/T337955) (owner: 10Esanders) [20:56:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925858 (https://phabricator.wikimedia.org/T337955) (owner: 10Esanders) [20:56:51] (03CR) 10Volans: "The last PS has reverted some changes agreed in previous PSs" [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [20:57:20] (03Merged) 10jenkins-bot: Remove deleted config wgVectorStickyHeaderEdit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925858 (https://phabricator.wikimedia.org/T337955) (owner: 10Esanders) [20:57:36] !log samtar@deploy1002 Started scap: Backport for [[gerrit:925858|Remove deleted config wgVectorStickyHeaderEdit (T337955)]] [20:58:19] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10SCherukuwada) As far as I can tell, a lot of the pages that aren't appearing in the index are simply not linked to from within the Wiki. There are... [20:59:15] !log samtar@deploy1002 esanders and samtar: Backport for [[gerrit:925858|Remove deleted config wgVectorStickyHeaderEdit (T337955)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:59:32] kimberly_sarabia: ^ live on mwdebug, can you test this change? [20:59:47] yup [21:00:30] LGTM [21:00:38] syncing [21:06:07] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:925858|Remove deleted config wgVectorStickyHeaderEdit (T337955)]] (duration: 08m 30s) [21:06:10] T337955: Edit buttons not appearing within sticky header - https://phabricator.wikimedia.org/T337955 [21:06:13] and live :) [21:09:35] tysm! [21:22:20] (03PS1) 10JHathaway: don't export resources when $::settings::storeconfigs is false [puppet] - 10https://gerrit.wikimedia.org/r/925968 (https://phabricator.wikimedia.org/T337972) [21:23:25] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/925968 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:24:35] (03CR) 10CI reject: [V: 04-1] don't export resources when $::settings::storeconfigs is false [puppet] - 10https://gerrit.wikimedia.org/r/925968 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:24:36] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Soda) >>! In T325607#8897400, @SCherukuwada wrote: > As far as I can tell, a lot of the pages that aren't appearing in the index are simply not lin... [21:25:35] (03PS2) 10JHathaway: add container facts [puppet] - 10https://gerrit.wikimedia.org/r/925935 (https://phabricator.wikimedia.org/T337972) [21:25:38] (03PS3) 10EoghanGaffney: releases: Add new hosts to failover servers list [puppet] - 10https://gerrit.wikimedia.org/r/924970 (https://phabricator.wikimedia.org/T334435) [21:25:40] (03PS1) 10EoghanGaffney: doc: Switch sync between nodes to rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/925969 (https://phabricator.wikimedia.org/T333945) [21:25:54] (03PS2) 10EoghanGaffney: doc: Switch sync between nodes to rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/925969 (https://phabricator.wikimedia.org/T333945) [21:27:01] (03PS13) 10Krinkle: webperf: Fix /excimer/ POST restriction [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) [21:29:41] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/925873 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:29:47] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/925862 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:29:55] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/925893 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:30:04] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/925877 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:30:09] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/925919 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:30:14] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/925922 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:30:30] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [21:30:44] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/925935 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:31:08] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41507/console" [puppet] - 10https://gerrit.wikimedia.org/r/925969 (https://phabricator.wikimedia.org/T333945) (owner: 10EoghanGaffney) [21:31:24] (03PS1) 10BryanDavis: phabricator: June 2023 pride month logo variant [puppet] - 10https://gerrit.wikimedia.org/r/925970 (https://phabricator.wikimedia.org/T337964) [21:38:04] (03CR) 10Krinkle: [C: 04-1] "This doesn't seem to work?" [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [21:38:09] (03CR) 10Krinkle: [C: 04-1] webperf: Fix /excimer/ POST restriction [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [21:42:36] (03CR) 10BryanDavis: "PCC shows this having no change at all to the catalogs on phab1004.eqiad.wmnet and phab2002.codfw.wmnet (03CR) 10Brennen Bearnes: [C: 04-1] phabricator: June 2023 pride month logo variant (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925970 (https://phabricator.wikimedia.org/T337964) (owner: 10BryanDavis) [21:48:23] (03PS3) 10EoghanGaffney: doc: Switch sync between nodes to rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/925969 (https://phabricator.wikimedia.org/T333945) [21:49:34] (03PS2) 10JHathaway: don't export resources when $::settings::storeconfigs is false [puppet] - 10https://gerrit.wikimedia.org/r/925968 (https://phabricator.wikimedia.org/T337972) [21:49:53] (03CR) 10Brennen Bearnes: [C: 04-1] phabricator: June 2023 pride month logo variant (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925970 (https://phabricator.wikimedia.org/T337964) (owner: 10BryanDavis) [21:50:45] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41509/console" [puppet] - 10https://gerrit.wikimedia.org/r/925969 (https://phabricator.wikimedia.org/T333945) (owner: 10EoghanGaffney) [21:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [21:52:44] (03CR) 10CI reject: [V: 04-1] don't export resources when $::settings::storeconfigs is false [puppet] - 10https://gerrit.wikimedia.org/r/925968 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:55:10] (03Abandoned) 10BryanDavis: phabricator: June 2023 pride month logo variant [puppet] - 10https://gerrit.wikimedia.org/r/925970 (https://phabricator.wikimedia.org/T337964) (owner: 10BryanDavis) [22:04:33] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10wiki_willy) [22:05:13] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10wiki_willy) [22:11:43] (03PS4) 10EoghanGaffney: doc: Switch sync between nodes to rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/925969 (https://phabricator.wikimedia.org/T333945) [22:21:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [22:24:17] (03CR) 10EoghanGaffney: doc: Switch sync between nodes to rsync::quickdatacopy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925969 (https://phabricator.wikimedia.org/T333945) (owner: 10EoghanGaffney) [22:59:23] (03CR) 10Tim Starling: [C: 03+2] Remove runphpscriptletonallwikis.py [puppet] - 10https://gerrit.wikimedia.org/r/925295 (owner: 10Tim Starling) [23:00:16] (03PS1) 10Andrew Bogott: rabbitmq: Change node names to use the cname service name [puppet] - 10https://gerrit.wikimedia.org/r/926035 (https://phabricator.wikimedia.org/T336808) [23:00:38] (03CR) 10CI reject: [V: 04-1] rabbitmq: Change node names to use the cname service name [puppet] - 10https://gerrit.wikimedia.org/r/926035 (https://phabricator.wikimedia.org/T336808) (owner: 10Andrew Bogott) [23:02:15] (03PS1) 10Superpes15: [itwiktionary] Add a tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926036 (https://phabricator.wikimedia.org/T337688) [23:04:31] (03PS2) 10Andrew Bogott: rabbitmq: Change node names to use the cname service name [puppet] - 10https://gerrit.wikimedia.org/r/926035 (https://phabricator.wikimedia.org/T336808) [23:07:19] (03PS4) 10Tim Starling: Fix some mwscript bugs and clean up the style [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925654 [23:10:44] (03CR) 10Tim Starling: "PS4:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925654 (owner: 10Tim Starling) [23:12:50] (03CR) 10Andrew Bogott: [C: 03+2] rabbitmq: Change node names to use the cname service name [puppet] - 10https://gerrit.wikimedia.org/r/926035 (https://phabricator.wikimedia.org/T336808) (owner: 10Andrew Bogott) [23:24:27] (03PS1) 10Cwhite: prometheus: add external swagger checks to all sites [puppet] - 10https://gerrit.wikimedia.org/r/925119 (https://phabricator.wikimedia.org/T320620) [23:28:57] (03PS1) 10Cwhite: lvs: remove lvs::monitor_services [puppet] - 10https://gerrit.wikimedia.org/r/925120 (https://phabricator.wikimedia.org/T320620) [23:57:34] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10Andrew) I haven't dug much, but designate is currently failing on cloudser...