[00:00:05] RoanKattouw and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220201T0000). [00:00:05] SCardenasM: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:01:47] Hello [00:01:51] tgr: Yes I'll do your patch first [00:01:54] Hi! [00:01:55] SCardenasM: and then yours [00:01:59] Thanks! [00:02:18] tgr: Oh sorry your patch is a Puppet patch, never mind, I can't +2 those [00:02:53] (03PS2) 10Catrope: Lower The Wikipedia Library extension edit count [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758033 (https://phabricator.wikimedia.org/T288070) (owner: 10Scardenasmolinar) [00:02:57] (03CR) 10Catrope: [C: 03+2] Lower The Wikipedia Library extension edit count [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758033 (https://phabricator.wikimedia.org/T288070) (owner: 10Scardenasmolinar) [00:03:38] (03Merged) 10jenkins-bot: Lower The Wikipedia Library extension edit count [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758033 (https://phabricator.wikimedia.org/T288070) (owner: 10Scardenasmolinar) [00:04:55] for the time being, I'll try to apply it locally per https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/How_code_is_updated#Cherry-picking_a_patch_from_gerrit [00:05:37] tgr: wow. so nobody noticed for 3+ months? [00:05:42] SCardenasM: Your patch is ready for testing on mwdebug1002 [00:05:43] PROBLEM - Check systemd state on centrallog2002 is CRITICAL: CRITICAL - degraded: The following units failed: prune_old_srv_syslog_directories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:59] RoanKattouw: Testing now [00:06:00] bd808: no, mediawiki11 was used until earlier today [00:06:21] I assume mediawiki12 was taken out of the scap list in november for some reason [00:06:37] RECOVERY - Disk space on centrallog1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=centrallog1001&var-datasource=eqiad+prometheus/ops [00:06:39] tgr: you can clobber that bad setting from ops/puppet in horizon for the project, but that may end up causing even more confusion. [00:06:45] and then, it seems mediawiki11 broke? https://phabricator.wikimedia.org/T300525#7664161 [00:07:07] PROBLEM - Check systemd state on centrallog1001 is CRITICAL: CRITICAL - degraded: The following units failed: prune_old_srv_syslog_directories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:08] which do you think is preferable, horizon or cherry-picking? [00:07:13] PROBLEM - Check unit status of prune_old_srv_syslog_directories on centrallog1001 is CRITICAL: CRITICAL: Status of the systemd unit prune_old_srv_syslog_directories https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:07:32] tgr: probably the cherry-pick is a better approach. [00:09:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:13] PROBLEM - Check unit status of prune_old_srv_syslog_directories on centrallog2002 is CRITICAL: CRITICAL: Status of the systemd unit prune_old_srv_syslog_directories https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:10:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:10:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:42] RoanKattouw: LGTM! [00:11:04] (03PS3) 10Juan90264: Enable Local upload on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758495 [00:11:23] (03PS4) 10Juan90264: Enable Local upload on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758495 (https://phabricator.wikimedia.org/T300466) [00:11:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:54] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:758033|Lower The Wikipedia Library extension edit count (T288070)]] (duration: 00m 50s) [00:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:59] T288070: Deploy The Wikipedia Library Echo notification - https://phabricator.wikimedia.org/T288070 [00:13:43] SCardenasM: Deployed! Congrats on finally completing the TWL rollout plan :) [00:14:11] I wrote some of the original code in the TWL extension, so I'm very happy to see it used in production (although I'm sure most of my original code has been rewritten and improved by now) [00:14:15] Yay! Thanks, you have been great help :] [00:14:44] Hello [00:14:55] RECOVERY - Cassandra instance data free space on restbase2010 is OK: DISK OK - free space: /srv/cassandra/instance-data 11026 MB (31% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [00:15:16] RoanKattouw a lot of your code guided us in the right direction [00:15:21] Added a change to the backport now [00:18:24] !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good [00:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:42] (03CR) 10Catrope: [C: 03+2] Enable Local upload on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758495 (https://phabricator.wikimedia.org/T300466) (owner: 10Juan90264) [00:18:53] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Beta-Cluster-reproducible, 10Patch-For-Review: Beta cluster MediaWiki code not updating - https://phabricator.wikimedia.org/T300591 (10Tgr) [00:19:26] (03Merged) 10jenkins-bot: Enable Local upload on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758495 (https://phabricator.wikimedia.org/T300466) (owner: 10Juan90264) [00:20:09] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:20:27] Excellent merged! [00:21:05] Juan_90264: Your patch is ready for testing on mwdebug1002 [00:21:07] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Beta-Cluster-reproducible, 10Patch-For-Review: Beta cluster MediaWiki code not updating - https://phabricator.wikimedia.org/T300591 (10Tgr) That seemed to fix it for now. jenkins scap [[https://integration.wikimedia.org/ci/view/Beta... [00:21:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:40] Okay [00:22:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:22:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:28] RoanKattouw: I tested and approved! [00:23:49] Juan_90264: Alright, deploying [00:23:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:32] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:758495|Enable Local upload on ptwikinews (T300466)]] (duration: 00m 50s) [00:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:37] T300466: enable local upload on ptwikinews - https://phabricator.wikimedia.org/T300466 [00:28:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:58] (03CR) 10RLazarus: [C: 03+2] Beta: Replace mediawiki11 with mediawiki12 [puppet] - 10https://gerrit.wikimedia.org/r/758584 (https://phabricator.wikimedia.org/T300591) (owner: 10Gergő Tisza) [00:30:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:30:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:40] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Beta-Cluster-reproducible, 10Patch-For-Review: Beta cluster MediaWiki code not updating - https://phabricator.wikimedia.org/T300591 (10Tgr) @ema seems fixed but I'm not sure if this is the direction we want to go in (as opposed to f... [00:43:59] RECOVERY - Cassandra instance data free space on restbase2012 is OK: DISK OK - free space: /srv/cassandra/instance-data 10987 MB (31% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [01:23:51] (03CR) 10Ryan Kemper: [C: 03+2] rdf query service: Use constant filename for defaults [puppet] - 10https://gerrit.wikimedia.org/r/757124 (https://phabricator.wikimedia.org/T299222) (owner: 10Ebernhardson) [01:24:59] !log T299222 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/757124; running puppet on `w*qs*` before purging old filepaths [01:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:05] T299222: Properly configure logback for W[CD]QS streaming updater - https://phabricator.wikimedia.org/T299222 [01:42:35] !log T299222 `ryankemper@cumin1001:~$ sudo cumin 'wdqs*' 'sudo rm -fv /etc/default/wdqs-updater'` [01:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:42:38] T299222: Properly configure logback for W[CD]QS streaming updater - https://phabricator.wikimedia.org/T299222 [01:42:41] !log T299222 `ryankemper@cumin1001:~$ sudo cumin 'wcqs*' 'sudo rm -fv /etc/default/wcqs-updater'` [01:42:42] 10SRE, 10SRE-Access-Requests, 10Analytics, 10User-Ladsgroup: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10Ladsgroup) a:03Ladsgroup According to https://github.com/wikimedia/puppet/blob/production/modules/admin/data/data.yaml your shell username is dannyh, or you... [01:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:14] (03CR) 10Ryan Kemper: [C: 03+2] Add cname for commons-query.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/717606 (https://phabricator.wikimedia.org/T282117) (owner: 10Ebernhardson) [01:48:17] !log T282117 Merged https://gerrit.wikimedia.org/r/c/operations/dns/+/717606 and successfully ran `sudo -i authdns-update` on `authdns1001`. `commons-query.wikimedia.org` is online now. (sidenote: go-live date of service is 2022-02-01) [01:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:20] T282117: WCQS needs to be exposed through a wikimedia.org domain - https://phabricator.wikimedia.org/T282117 [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220201T0200) [02:07:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.20 [core] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/758593 [02:07:10] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.20 [core] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/758593 (owner: 10TrainBranchBot) [02:07:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:09:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:42] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet2004-dev.codfw.wmnet with OS bullseye [02:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:39] (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.20 [core] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/758593 (owner: 10TrainBranchBot) [02:24:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:02] PROBLEM - Apache HTTP on wtp1029 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1311 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:42:40] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:02:10] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:25:56] PROBLEM - Disk space on deneb is CRITICAL: DISK CRITICAL - free space: / 4682 MB (2% inode=71%): /tmp 4682 MB (2% inode=71%): /var/tmp 4682 MB (2% inode=71%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=deneb&var-datasource=codfw+prometheus/ops [03:34:36] (03PS1) 10Andrew Bogott: Neutron/Victoria/Bullseye: try removing iptables entirely [puppet] - 10https://gerrit.wikimedia.org/r/758595 [03:36:00] (03CR) 10Andrew Bogott: [C: 03+2] Neutron/Victoria/Bullseye: try removing iptables entirely [puppet] - 10https://gerrit.wikimedia.org/r/758595 (owner: 10Andrew Bogott) [03:36:50] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudnet2004-dev.codfw.wmnet with OS bullseye [03:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:37:10] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet2004-dev.codfw.wmnet with OS bullseye [03:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:43:20] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:02:11] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:08:26] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet2004-dev.codfw.wmnet with OS bullseye [05:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:23] (03PS4) 10Juan90264: Add 'wgUploadNavigationUrl' upload page of ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758497 (https://phabricator.wikimedia.org/T300466) [05:38:26] (03PS1) 10Ladsgroup: admin: Add dannyh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/758603 (https://phabricator.wikimedia.org/T300579) [05:39:27] (03CR) 10jerkins-bot: [V: 04-1] admin: Add dannyh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/758603 (https://phabricator.wikimedia.org/T300579) (owner: 10Ladsgroup) [05:40:26] 10SRE, 10SRE-Access-Requests, 10Analytics, 10Patch-For-Review, 10User-Ladsgroup: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10Ladsgroup) I made the patch for it, please confirm that the correct LDAP username is dannyh and I will merge it. Keep it in mind this is... [05:44:29] (03PS6) 10KartikMistry: Deploy Flores MT [deployment-charts] - 10https://gerrit.wikimedia.org/r/751547 (https://phabricator.wikimedia.org/T298584) [05:45:45] (03PS2) 10Ladsgroup: admin: Add dannyh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/758603 (https://phabricator.wikimedia.org/T300579) [05:46:49] (03CR) 10jerkins-bot: [V: 04-1] admin: Add dannyh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/758603 (https://phabricator.wikimedia.org/T300579) (owner: 10Ladsgroup) [05:53:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [05:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [05:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:25] (03PS3) 10Ladsgroup: admin: Add dannyh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/758603 (https://phabricator.wikimedia.org/T300579) [05:53:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298558)', diff saved to https://phabricator.wikimedia.org/P19710 and previous config saved to /var/cache/conftool/dbconfig/20220201-055327-marostegui.json [05:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:30] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [05:56:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 25%: repooling', diff saved to https://phabricator.wikimedia.org/P19711 and previous config saved to /var/cache/conftool/dbconfig/20220201-055638-root.json [05:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [05:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [05:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298558)', diff saved to https://phabricator.wikimedia.org/P19712 and previous config saved to /var/cache/conftool/dbconfig/20220201-055921-marostegui.json [05:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:24] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [06:00:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298558)', diff saved to https://phabricator.wikimedia.org/P19713 and previous config saved to /var/cache/conftool/dbconfig/20220201-060035-marostegui.json [06:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:34] (03PS1) 10Marostegui: Revert "db1154: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758498 [06:02:18] (03CR) 10Marostegui: [C: 03+2] Revert "db1154: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758498 (owner: 10Marostegui) [06:07:09] (03PS1) 10Marostegui: db-production.php: Disable writes on es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758715 (https://phabricator.wikimedia.org/T300127) [06:07:34] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758715 (https://phabricator.wikimedia.org/T300127) (owner: 10Marostegui) [06:10:25] Hello [06:11:34] Hey :) [06:11:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 50%: repooling', diff saved to https://phabricator.wikimedia.org/P19714 and previous config saved to /var/cache/conftool/dbconfig/20220201-061142-root.json [06:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:10] Why is there no "UTC evening backport window" today? [06:14:53] thcipriani: ? [06:15:09] can't see anything on `ops-l` about it (other than changes happening in a couple of weeks..) [06:15:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P19715 and previous config saved to /var/cache/conftool/dbconfig/20220201-061540-marostegui.json [06:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:04] (03PS1) 10Marostegui: mariadb: Promote es1020 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/758716 (https://phabricator.wikimedia.org/T300127) [06:16:19] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [puppet] - 10https://gerrit.wikimedia.org/r/758716 (https://phabricator.wikimedia.org/T300127) (owner: 10Marostegui) [06:16:43] (03CR) 10Ladsgroup: [C: 03+1] db-production.php: Disable writes on es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758715 (https://phabricator.wikimedia.org/T300127) (owner: 10Marostegui) [06:17:20] (03PS1) 10Marostegui: wmnet: Promote es1020 to es4 master [dns] - 10https://gerrit.wikimedia.org/r/758717 (https://phabricator.wikimedia.org/T300127) [06:17:58] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [dns] - 10https://gerrit.wikimedia.org/r/758717 (https://phabricator.wikimedia.org/T300127) (owner: 10Marostegui) [06:20:54] (03PS1) 10Marostegui: db1110: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758719 (https://phabricator.wikimedia.org/T300473) [06:21:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110 for reimage T300473', diff saved to https://phabricator.wikimedia.org/P19716 and previous config saved to /var/cache/conftool/dbconfig/20220201-062111-marostegui.json [06:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:14] T300473: Upgrade s5 to Bullseye - https://phabricator.wikimedia.org/T300473 [06:21:58] (03CR) 10Marostegui: [C: 03+2] db1110: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758719 (https://phabricator.wikimedia.org/T300473) (owner: 10Marostegui) [06:24:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1110.eqiad.wmnet with OS bullseye [06:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 75%: repooling', diff saved to https://phabricator.wikimedia.org/P19717 and previous config saved to /var/cache/conftool/dbconfig/20220201-062646-root.json [06:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [06:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [06:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [06:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [06:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [06:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [06:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [06:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [06:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T300402)', diff saved to https://phabricator.wikimedia.org/P19718 and previous config saved to /var/cache/conftool/dbconfig/20220201-063013-marostegui.json [06:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:16] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [06:30:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P19719 and previous config saved to /var/cache/conftool/dbconfig/20220201-063044-marostegui.json [06:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T300402)', diff saved to https://phabricator.wikimedia.org/P19720 and previous config saved to /var/cache/conftool/dbconfig/20220201-063126-marostegui.json [06:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:42] tn: finding this strange.... Just when I would choose to deploy, https://wikitech.wikimedia.org/wiki/Deployments#Tuesday,_February_1 [06:34:11] *I'm finding [06:35:12] Juan_90264: what are you finding strange? [06:36:53] @marostegui: Why does "Pre MediaWiki train break" only appear? Couldn't put it together with the evening backport [06:37:18] *Couldn't you put it together with the evening backport? [06:37:43] I say today [06:41:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 100%: repooling', diff saved to https://phabricator.wikimedia.org/P19721 and previous config saved to /var/cache/conftool/dbconfig/20220201-064149-root.json [06:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298558)', diff saved to https://phabricator.wikimedia.org/P19722 and previous config saved to /var/cache/conftool/dbconfig/20220201-064549-marostegui.json [06:45:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [06:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:52] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [06:45:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [06:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:08] (03PS1) 10Marostegui: Revert "db1110: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758499 [06:46:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [06:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [06:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T298558)', diff saved to https://phabricator.wikimedia.org/P19723 and previous config saved to /var/cache/conftool/dbconfig/20220201-064620-marostegui.json [06:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P19724 and previous config saved to /var/cache/conftool/dbconfig/20220201-064631-marostegui.json [06:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298558)', diff saved to https://phabricator.wikimedia.org/P19725 and previous config saved to /var/cache/conftool/dbconfig/20220201-064734-marostegui.json [06:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:59] (03CR) 10Marostegui: [C: 03+2] Revert "db1110: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758499 (owner: 10Marostegui) [06:49:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 5%: repooling', diff saved to https://phabricator.wikimedia.org/P19726 and previous config saved to /var/cache/conftool/dbconfig/20220201-064930-root.json [06:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:40] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host db1110.eqiad.wmnet with OS bullseye [06:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10elukey) >>! In T294946#7665517, @Papaul wrote: > Can someone please update this task with the Partitioning/Raid information? > > Thanks. Hi Papaul! Th... [07:01:33] (03CR) 10Elukey: profile::kafka::broker: add pki_intermediate_name parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757800 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [07:01:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P19727 and previous config saved to /var/cache/conftool/dbconfig/20220201-070135-marostegui.json [07:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P19728 and previous config saved to /var/cache/conftool/dbconfig/20220201-070239-marostegui.json [07:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 10%: repooling', diff saved to https://phabricator.wikimedia.org/P19729 and previous config saved to /var/cache/conftool/dbconfig/20220201-070434-root.json [07:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:40] (03PS1) 10Elukey: install_server: add partman recipe for ml-staging nodes [puppet] - 10https://gerrit.wikimedia.org/r/758725 (https://phabricator.wikimedia.org/T294946) [07:16:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T300402)', diff saved to https://phabricator.wikimedia.org/P19730 and previous config saved to /var/cache/conftool/dbconfig/20220201-071640-marostegui.json [07:16:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [07:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [07:16:44] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [07:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T300402)', diff saved to https://phabricator.wikimedia.org/P19731 and previous config saved to /var/cache/conftool/dbconfig/20220201-071648-marostegui.json [07:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P19732 and previous config saved to /var/cache/conftool/dbconfig/20220201-071743-marostegui.json [07:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T300402)', diff saved to https://phabricator.wikimedia.org/P19733 and previous config saved to /var/cache/conftool/dbconfig/20220201-071801-marostegui.json [07:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 25%: repooling', diff saved to https://phabricator.wikimedia.org/P19734 and previous config saved to /var/cache/conftool/dbconfig/20220201-071938-root.json [07:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:23] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Beta-Cluster-reproducible: Beta cluster MediaWiki code not updating - https://phabricator.wikimedia.org/T300591 (10AlexisJazz) Beta cluster is down again. (see T300525#7666400) [07:24:21] (03CR) 10Elukey: [C: 03+2] install_server: add partman recipe for ml-staging nodes [puppet] - 10https://gerrit.wikimedia.org/r/758725 (https://phabricator.wikimedia.org/T294946) (owner: 10Elukey) [07:32:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298558)', diff saved to https://phabricator.wikimedia.org/P19735 and previous config saved to /var/cache/conftool/dbconfig/20220201-073248-marostegui.json [07:32:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [07:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [07:32:52] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [07:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298558)', diff saved to https://phabricator.wikimedia.org/P19736 and previous config saved to /var/cache/conftool/dbconfig/20220201-073256-marostegui.json [07:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P19737 and previous config saved to /var/cache/conftool/dbconfig/20220201-073306-marostegui.json [07:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 50%: repooling', diff saved to https://phabricator.wikimedia.org/P19738 and previous config saved to /var/cache/conftool/dbconfig/20220201-073441-root.json [07:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:32] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:36:41] (03CR) 10Filippo Giunchedi: [C: 03+1] "Good starting point! LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/757489 (https://phabricator.wikimedia.org/T294564) (owner: 10Volans) [07:39:17] !log filippo@puppetmaster1001 conftool action : set/weight=10; selector: name=prometheus1005.eqiad.wmnet [07:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:36] (03PS1) 10Filippo Giunchedi: hieradata: swap prometheus1003 with prometheus1005 [puppet] - 10https://gerrit.wikimedia.org/r/758776 (https://phabricator.wikimedia.org/T296199) [07:42:17] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: swap prometheus1003 with prometheus1005 [puppet] - 10https://gerrit.wikimedia.org/r/758776 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [07:43:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298558)', diff saved to https://phabricator.wikimedia.org/P19739 and previous config saved to /var/cache/conftool/dbconfig/20220201-074311-marostegui.json [07:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:15] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [07:47:03] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=prometheus1005.eqiad.wmnet [07:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P19740 and previous config saved to /var/cache/conftool/dbconfig/20220201-074810-marostegui.json [07:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 75%: repooling', diff saved to https://phabricator.wikimedia.org/P19741 and previous config saved to /var/cache/conftool/dbconfig/20220201-074945-root.json [07:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:21] (03PS1) 10Filippo Giunchedi: conftool-data: add eqiad prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/758778 [07:50:53] (03PS1) 10Elukey: ores::web: log XFF header as REMOTE_ADDR when available [puppet] - 10https://gerrit.wikimedia.org/r/758779 (https://phabricator.wikimedia.org/T299137) [07:51:55] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33526/console" [puppet] - 10https://gerrit.wikimedia.org/r/758779 (https://phabricator.wikimedia.org/T299137) (owner: 10Elukey) [07:53:53] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) To further clarify how I conduct the no-timeout tests: - `sudo puppet agent --disable "T292322 --$USER"` `/etc/envoy/envoy.yaml`: - Change the timeou... [07:55:45] (03CR) 10Filippo Giunchedi: [C: 03+2] conftool-data: add eqiad prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/758778 (owner: 10Filippo Giunchedi) [07:56:24] !log filippo@puppetmaster1001 conftool action : set/weight=10; selector: name=prometheus1005.eqiad.wmnet [07:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:31] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=prometheus1005.eqiad.wmnet [07:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P19742 and previous config saved to /var/cache/conftool/dbconfig/20220201-075816-marostegui.json [07:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:47] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=prometheus1003.eqiad.wmnet [08:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:15] (03PS1) 10Muehlenhoff: Remove access for bumeh-ctr [puppet] - 10https://gerrit.wikimedia.org/r/758780 [08:03:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T300402)', diff saved to https://phabricator.wikimedia.org/P19743 and previous config saved to /var/cache/conftool/dbconfig/20220201-080315-marostegui.json [08:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [08:03:19] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [08:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [08:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T300402)', diff saved to https://phabricator.wikimedia.org/P19744 and previous config saved to /var/cache/conftool/dbconfig/20220201-080328-marostegui.json [08:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:25] (03PS1) 10Marostegui: s3 codfw db*: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758781 (https://phabricator.wikimedia.org/T300600) [08:04:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T300402)', diff saved to https://phabricator.wikimedia.org/P19745 and previous config saved to /var/cache/conftool/dbconfig/20220201-080442-marostegui.json [08:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 100%: repooling', diff saved to https://phabricator.wikimedia.org/P19746 and previous config saved to /var/cache/conftool/dbconfig/20220201-080449-root.json [08:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:43] (03CR) 10Marostegui: [C: 03+2] s3 codfw db*: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758781 (https://phabricator.wikimedia.org/T300600) (owner: 10Marostegui) [08:05:46] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:06:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2074.codfw.wmnet with OS bullseye [08:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2109.codfw.wmnet with OS bullseye [08:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:25] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for bumeh-ctr [puppet] - 10https://gerrit.wikimedia.org/r/758780 (owner: 10Muehlenhoff) [08:09:14] PROBLEM - Check no envoy runtime configuration is left persistent on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 396 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:10:32] (03PS1) 10Marostegui: db1100: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758782 (https://phabricator.wikimedia.org/T300473) [08:10:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1100 for reimage T300473', diff saved to https://phabricator.wikimedia.org/P19747 and previous config saved to /var/cache/conftool/dbconfig/20220201-081050-marostegui.json [08:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:54] T300473: Upgrade s5 to Bullseye - https://phabricator.wikimedia.org/T300473 [08:12:35] (03CR) 10Marostegui: [C: 03+2] db1100: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758782 (https://phabricator.wikimedia.org/T300473) (owner: 10Marostegui) [08:13:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P19748 and previous config saved to /var/cache/conftool/dbconfig/20220201-081321-marostegui.json [08:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1100.eqiad.wmnet with OS bullseye [08:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:13] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33527/console" [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi) [08:19:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P19749 and previous config saved to /var/cache/conftool/dbconfig/20220201-081947-marostegui.json [08:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:10] (03CR) 10JMeybohm: [C: 03+1] imagecatalog: Only run on the active deployment host [puppet] - 10https://gerrit.wikimedia.org/r/757530 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [08:23:41] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1008.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [08:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1008.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [08:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:12] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) >>! In T299527#7664606, @Cmjohnson wrote: > 1010 is updated, 1019 is locking up, I will need to power off and unplug Ack, thank... [08:26:29] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) [08:27:54] (03CR) 10JMeybohm: [C: 03+1] mediawiki::logging::yaml_defs: use wmf-certificates' bundle as CA cert [puppet] - 10https://gerrit.wikimedia.org/r/757661 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [08:28:02] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1100.eqiad.wmnet with OS bullseye [08:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298558)', diff saved to https://phabricator.wikimedia.org/P19750 and previous config saved to /var/cache/conftool/dbconfig/20220201-082825-marostegui.json [08:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:28] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [08:28:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [08:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [08:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [08:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [08:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [08:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [08:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T298558)', diff saved to https://phabricator.wikimedia.org/P19751 and previous config saved to /var/cache/conftool/dbconfig/20220201-082906-marostegui.json [08:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298558)', diff saved to https://phabricator.wikimedia.org/P19752 and previous config saved to /var/cache/conftool/dbconfig/20220201-083020-marostegui.json [08:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:22] 10SRE, 10Python3-Porting: git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10MoritzMuehlenhoff) >>! In T279509#7663452, @Ladsgroup wrote: >>>! In T279509#6979904, @MoritzMuehlenhoff wrote: >> git-fat is the only package requiring Python 2 in a base bullseye setup at this poin... [08:33:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1100.eqiad.wmnet with OS bullseye [08:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P19753 and previous config saved to /var/cache/conftool/dbconfig/20220201-083452-marostegui.json [08:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:25] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:36:18] (03PS1) 10JMeybohm: Add affinity to SSD nodes to termbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/758785 [08:38:50] !log draining ganeti1016 for eventual reimage [08:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2109.codfw.wmnet with OS bullseye [08:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2074.codfw.wmnet with OS bullseye [08:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2149.codfw.wmnet with OS bullseye [08:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P19754 and previous config saved to /var/cache/conftool/dbconfig/20220201-084524-marostegui.json [08:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2127.codfw.wmnet with OS bullseye [08:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T300402)', diff saved to https://phabricator.wikimedia.org/P19755 and previous config saved to /var/cache/conftool/dbconfig/20220201-084956-marostegui.json [08:49:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [08:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [08:50:00] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [08:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [08:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [08:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T300402)', diff saved to https://phabricator.wikimedia.org/P19756 and previous config saved to /var/cache/conftool/dbconfig/20220201-085040-marostegui.json [08:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T300402)', diff saved to https://phabricator.wikimedia.org/P19757 and previous config saved to /var/cache/conftool/dbconfig/20220201-085155-marostegui.json [08:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:15] (03PS1) 10Marostegui: Revert "db1100: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758501 [08:55:20] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [08:57:33] (03PS2) 10Vgutierrez: site: Reimage cp3062 as cache::text_envoy [puppet] - 10https://gerrit.wikimedia.org/r/758512 (https://phabricator.wikimedia.org/T271421) [08:57:43] (03PS2) 10Jcrespo: backup: remove fileset for static-bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/757509 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [08:58:04] (03PS1) 10Joal: Make hdfs user able to run yarn applications [puppet] - 10https://gerrit.wikimedia.org/r/758787 (https://phabricator.wikimedia.org/T300611) [08:58:44] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) So apparently my small fixes to PagedTiffHandler shaved off about 100 seconds off of the shellbox-based request, see: https://performance.wikimedia.... [08:59:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1100.eqiad.wmnet with OS bullseye [08:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:23] (03CR) 10Marostegui: [C: 03+2] Revert "db1100: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758501 (owner: 10Marostegui) [09:00:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P19758 and previous config saved to /var/cache/conftool/dbconfig/20220201-090029-marostegui.json [09:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 5%: repooling', diff saved to https://phabricator.wikimedia.org/P19759 and previous config saved to /var/cache/conftool/dbconfig/20220201-090031-root.json [09:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:31] !log depool cp3062 to be reimaged as cache::text_envoy - T271421 [09:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:33] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [09:02:11] !log apt1001 Delete unused stretch and buster dist libvarnisapi1 package T300264 [09:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:14] (03CR) 10Jcrespo: [C: 03+2] backup: remove fileset for static-bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/757509 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [09:02:56] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp3062 as cache::text_envoy [puppet] - 10https://gerrit.wikimedia.org/r/758512 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [09:03:55] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp3062.esams.wmnet with OS buster [09:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:04] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3062.esams.wmnet with OS buster [09:07:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P19760 and previous config saved to /var/cache/conftool/dbconfig/20220201-090700-marostegui.json [09:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:10] RECOVERY - cassandra-a service on restbase1027 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:15:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298558)', diff saved to https://phabricator.wikimedia.org/P19761 and previous config saved to /var/cache/conftool/dbconfig/20220201-091534-marostegui.json [09:15:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [09:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [09:15:37] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [09:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 10%: repooling', diff saved to https://phabricator.wikimedia.org/P19762 and previous config saved to /var/cache/conftool/dbconfig/20220201-091541-root.json [09:15:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T298558)', diff saved to https://phabricator.wikimedia.org/P19763 and previous config saved to /var/cache/conftool/dbconfig/20220201-091541-marostegui.json [09:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:28] PROBLEM - cassandra-a service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:16:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2149.codfw.wmnet with OS bullseye [09:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:40] (03PS16) 10ArielGlenn: snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [09:20:46] !log installing apache/apache-modsecurity2 security updates [09:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:34] (03PS1) 10Marostegui: Revert "s3 codfw db*: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758503 [09:21:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2127.codfw.wmnet with OS bullseye [09:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P19764 and previous config saved to /var/cache/conftool/dbconfig/20220201-092204-marostegui.json [09:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:07] (03Abandoned) 10Joal: Make hdfs user able to run yarn applications [puppet] - 10https://gerrit.wikimedia.org/r/758787 (https://phabricator.wikimedia.org/T300611) (owner: 10Joal) [09:29:55] (03CR) 10Gehel: [C: 03+1] "LGTM. PCC seems to agree: https://puppet-compiler.wmflabs.org/pcc-worker1003/33529/" [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [09:30:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 25%: repooling', diff saved to https://phabricator.wikimedia.org/P19765 and previous config saved to /var/cache/conftool/dbconfig/20220201-093044-root.json [09:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:32] (03PS7) 10JMeybohm: Add hostname-override and cluster-cidr to kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500) [09:33:18] (03CR) 10DCausse: [C: 03+1] rdf-query-service: consistently suffix env vars [puppet] - 10https://gerrit.wikimedia.org/r/757996 (owner: 10Ebernhardson) [09:35:03] (03CR) 10ArielGlenn: [C: 03+2] "I verified that the change is good on sql/xml dumps snapshot hosts as well as on the snapshot server that runs the "misc" dumps. Thanks fo" [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [09:36:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298558)', diff saved to https://phabricator.wikimedia.org/P19766 and previous config saved to /var/cache/conftool/dbconfig/20220201-093653-marostegui.json [09:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:57] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [09:37:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T300402)', diff saved to https://phabricator.wikimedia.org/P19767 and previous config saved to /var/cache/conftool/dbconfig/20220201-093709-marostegui.json [09:37:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [09:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:12] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [09:37:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [09:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T300402)', diff saved to https://phabricator.wikimedia.org/P19768 and previous config saved to /var/cache/conftool/dbconfig/20220201-093717-marostegui.json [09:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T300402)', diff saved to https://phabricator.wikimedia.org/P19769 and previous config saved to /var/cache/conftool/dbconfig/20220201-093747-marostegui.json [09:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:54] (03CR) 10ArielGlenn: "One more note (hey, why isn't there a task for this patch?), let's convert the actual cron jobs one at a time, so we can schedule each pat" [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [09:38:38] ACKNOWLEDGEMENT - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_network_flows_internal-sanitization_daily.service,refinery-sqoop-whole-mediawiki.service Btullis Investigating this sqoop failure. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 50%: repooling', diff saved to https://phabricator.wikimedia.org/P19770 and previous config saved to /var/cache/conftool/dbconfig/20220201-094548-root.json [09:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:10] * kart_ updating cxserver in few minutes.. [09:49:38] (03PS1) 10Jelto: hiera/cloud/gitlab-test: add floating ip [puppet] - 10https://gerrit.wikimedia.org/r/758793 (https://phabricator.wikimedia.org/T297411) [09:50:21] RECOVERY - Disk space on deneb is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=deneb&var-datasource=codfw+prometheus/ops [09:51:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P19771 and previous config saved to /var/cache/conftool/dbconfig/20220201-095158-marostegui.json [09:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:12] (03CR) 10Marostegui: [C: 03+2] Revert "s3 codfw db*: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758503 (owner: 10Marostegui) [09:52:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P19772 and previous config saved to /var/cache/conftool/dbconfig/20220201-095251-marostegui.json [09:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:34] (03CR) 10KartikMistry: [C: 03+2] "Deployment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/751547 (https://phabricator.wikimedia.org/T298584) (owner: 10KartikMistry) [09:58:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1010.eqiad.wmnet with OS buster [09:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:06] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1010.eqiad.wmnet with OS buster [09:59:13] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01012 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:59:39] (03Merged) 10jenkins-bot: Deploy Flores MT [deployment-charts] - 10https://gerrit.wikimedia.org/r/751547 (https://phabricator.wikimedia.org/T298584) (owner: 10KartikMistry) [10:00:42] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply on staging [10:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:45] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply on production [10:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 75%: repooling', diff saved to https://phabricator.wikimedia.org/P19773 and previous config saved to /var/cache/conftool/dbconfig/20220201-100052-root.json [10:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:16] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: sync on staging [10:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:47] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3062.esams.wmnet with OS buster [10:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:57] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3062.esams.wmnet with OS buster completed: - cp3062 (**WARN*... [10:01:58] !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host netflow6001.drmrs.wmnet [10:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:11] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/758785 (owner: 10JMeybohm) [10:05:10] (03CR) 10JMeybohm: [C: 03+2] Add affinity to SSD nodes to termbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/758785 (owner: 10JMeybohm) [10:05:22] (03CR) 10Elukey: [V: 03+1] "I added the option manually for ores1001, and I can see in logstash the first external IPs coming up, so it seems working. Adding Filippo " [puppet] - 10https://gerrit.wikimedia.org/r/758779 (https://phabricator.wikimedia.org/T299137) (owner: 10Elukey) [10:05:45] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply on production [10:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:48] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply on staging [10:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P19774 and previous config saved to /var/cache/conftool/dbconfig/20220201-100703-marostegui.json [10:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P19775 and previous config saved to /var/cache/conftool/dbconfig/20220201-100756-marostegui.json [10:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:02] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: sync on production [10:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:59] (03Merged) 10jenkins-bot: Add affinity to SSD nodes to termbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/758785 (owner: 10JMeybohm) [10:09:29] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01012 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:10:25] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply on production [10:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:27] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply on staging [10:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:22] (03CR) 10Giuseppe Lavagetto: "First the good part: I think the change makes a lot of sense, and the changes to the data make sense. I left a couple nitpicks." [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:14:19] !log pool cp3062 running envoy as TLS terminator - T271421 [10:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:22] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [10:16:01] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01012 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:16:38] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez) [10:17:20] (03PS1) 10Ayounsi: Add netflow6001 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/758795 [10:18:09] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01069 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:20:42] (03PS1) 10Marostegui: db2105: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758797 (https://phabricator.wikimedia.org/T300600) [10:20:54] (03CR) 10Elukey: "LGTM, thanks a lot for the investigation work, really nice!" [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm) [10:21:50] (03CR) 10Marostegui: [C: 03+2] db2105: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758797 (https://phabricator.wikimedia.org/T300600) (owner: 10Marostegui) [10:22:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298558)', diff saved to https://phabricator.wikimedia.org/P19777 and previous config saved to /var/cache/conftool/dbconfig/20220201-102207-marostegui.json [10:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [10:22:11] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [10:22:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [10:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T298558)', diff saved to https://phabricator.wikimedia.org/P19778 and previous config saved to /var/cache/conftool/dbconfig/20220201-102221-marostegui.json [10:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T300402)', diff saved to https://phabricator.wikimedia.org/P19779 and previous config saved to /var/cache/conftool/dbconfig/20220201-102300-marostegui.json [10:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:04] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [10:23:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298558)', diff saved to https://phabricator.wikimedia.org/P19780 and previous config saved to /var/cache/conftool/dbconfig/20220201-102333-marostegui.json [10:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [10:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [10:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:54] (03PS5) 10Filippo Giunchedi: service catalog: introduce 'page' field [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946) [10:23:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T300402)', diff saved to https://phabricator.wikimedia.org/P19781 and previous config saved to /var/cache/conftool/dbconfig/20220201-102356-marostegui.json [10:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:16] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Bumeh-ctr out of all services on: 5 hosts [10:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Bumeh-ctr out of all services on: 5 hosts [10:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2105.codfw.wmnet with OS bullseye [10:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1010.eqiad.wmnet with OS buster [10:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:05] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1010.eqiad.wmnet with OS buster completed: - ganeti1010 (**PASS**)... [10:25:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove contributions from s4 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P19782 and previous config saved to /var/cache/conftool/dbconfig/20220201-102512-marostegui.json [10:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:15] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [10:25:51] (03CR) 10Filippo Giunchedi: [V: 03+1] "Thank you for" [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:26:36] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005621 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:28:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T300402)', diff saved to https://phabricator.wikimedia.org/P19783 and previous config saved to /var/cache/conftool/dbconfig/20220201-102857-marostegui.json [10:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:00] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [10:33:13] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/33532/" [puppet] - 10https://gerrit.wikimedia.org/r/758795 (owner: 10Ayounsi) [10:33:38] RECOVERY - cassandra-c service on restbase1027 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:34:10] RECOVERY - cassandra-b service on restbase1027 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:34:18] RECOVERY - Check systemd state on restbase1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:48] RECOVERY - cassandra-b CQL 10.64.48.185:9042 on restbase1027 is OK: TCP OK - 0.000 second response time on 10.64.48.185 port 9042 https://phabricator.wikimedia.org/T93886 [10:35:08] RECOVERY - cassandra-b SSL 10.64.48.185:7001 on restbase1027 is OK: SSL OK - Certificate restbase1027-b valid until 2023-04-14 11:21:35 +0000 (expires in 437 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [10:35:10] RECOVERY - cassandra-c SSL 10.64.48.186:7001 on restbase1027 is OK: SSL OK - Certificate restbase1027-c valid until 2023-04-14 11:21:38 +0000 (expires in 437 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [10:35:12] RECOVERY - cassandra-c CQL 10.64.48.186:9042 on restbase1027 is OK: TCP OK - 0.000 second response time on 10.64.48.186 port 9042 https://phabricator.wikimedia.org/T93886 [10:37:25] (03PS8) 10JMeybohm: Add hostname-override and cluster-cidr to kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500) [10:38:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P19784 and previous config saved to /var/cache/conftool/dbconfig/20220201-103838-marostegui.json [10:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove all special groups from s4 codfw T263127', diff saved to https://phabricator.wikimedia.org/P19785 and previous config saved to /var/cache/conftool/dbconfig/20220201-104118-marostegui.json [10:41:19] !log restart ATS-TLS on cp3058 [10:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:22] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [10:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P19786 and previous config saved to /var/cache/conftool/dbconfig/20220201-104402-marostegui.json [10:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:47] RECOVERY - cassandra-a CQL 10.64.48.184:9042 on restbase1027 is OK: TCP OK - 0.000 second response time on 10.64.48.184 port 9042 https://phabricator.wikimedia.org/T93886 [10:47:33] jouncebot: nowandnext [10:47:34] No deployments scheduled for the next 1 hour(s) and 12 minute(s) [10:47:34] In 1 hour(s) and 12 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220201T1200) [10:49:49] alright, I’ll quickly (hopefully ^^) deploy something [10:51:15] RECOVERY - cassandra-a SSL 10.64.48.184:7001 on restbase1027 is OK: SSL OK - Certificate restbase1027-a valid until 2023-04-14 11:21:33 +0000 (expires in 437 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [10:52:55] (03CR) 10JMeybohm: Add hostname-override and cluster-cidr to kube-proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm) [10:53:03] (03CR) 10JMeybohm: [C: 03+2] Add hostname-override and cluster-cidr to kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm) [10:53:05] (03CR) 10Giuseppe Lavagetto: [V: 03+1] thanos::frontend: fix envoy configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757452 (https://phabricator.wikimedia.org/T300119) (owner: 10Giuseppe Lavagetto) [10:53:09] RECOVERY - cassandra-a service on restbase1027 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:53:13] !log Deployed patch for T297754 [10:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P19787 and previous config saved to /var/cache/conftool/dbconfig/20220201-105343-marostegui.json [10:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:56] 10SRE, 10Traffic: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247 (10ema) We should probably add a `Conflicts: libvarnishapi1` to our varnish 6 packaging, or whatever relationship magic is the right one to ensure that if varnish 6 is installed, libvarnishapi1 is not. [10:55:21] 10SRE, 10Traffic: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247 (10MoritzMuehlenhoff) >>! In T300247#7667111, @ema wrote: > We should probably add a `Conflicts: libvarnishapi1` to our varnish 6 packaging, or whatever relationship magic is the right one to ensure that if varnis... [10:55:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2105.codfw.wmnet with OS bullseye [10:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:27] (I’m done) [10:58:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:21] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 66 probes of 653 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:58:33] (03PS1) 10Marostegui: Revert "db2105: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758813 [10:59:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P19788 and previous config saved to /var/cache/conftool/dbconfig/20220201-105906-marostegui.json [10:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:58] (03PS1) 10Arturo Borrero Gonzalez: toolforge: exec-manage: fix 'list' option [puppet] - 10https://gerrit.wikimedia.org/r/758803 [11:03:11] PROBLEM - Host ganeti1013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:04:41] RECOVERY - Host ganeti1013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [11:08:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298558)', diff saved to https://phabricator.wikimedia.org/P19789 and previous config saved to /var/cache/conftool/dbconfig/20220201-110848-marostegui.json [11:08:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [11:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [11:08:51] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [11:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T298558)', diff saved to https://phabricator.wikimedia.org/P19790 and previous config saved to /var/cache/conftool/dbconfig/20220201-110855-marostegui.json [11:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:05] (03PS2) 10Hnowlan: restbase: remove restbase2011 [puppet] - 10https://gerrit.wikimedia.org/r/757648 (https://phabricator.wikimedia.org/T299928) [11:10:56] (03PS2) 10Hnowlan: maps: tweak postgres configuration settings to use more resources [puppet] - 10https://gerrit.wikimedia.org/r/757424 (https://phabricator.wikimedia.org/T298246) [11:12:52] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33534/console" [puppet] - 10https://gerrit.wikimedia.org/r/757424 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [11:13:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: exec-manage: fix 'list' option [puppet] - 10https://gerrit.wikimedia.org/r/758803 (owner: 10Arturo Borrero Gonzalez) [11:14:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298558)', diff saved to https://phabricator.wikimedia.org/P19791 and previous config saved to /var/cache/conftool/dbconfig/20220201-111409-marostegui.json [11:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:12] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [11:14:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T300402)', diff saved to https://phabricator.wikimedia.org/P19792 and previous config saved to /var/cache/conftool/dbconfig/20220201-111413-marostegui.json [11:14:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [11:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:16] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [11:14:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [11:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T300402)', diff saved to https://phabricator.wikimedia.org/P19793 and previous config saved to /var/cache/conftool/dbconfig/20220201-111420-marostegui.json [11:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:49] !log roll-restarting maps services in codfw for updates [11:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:25] (03CR) 10Jgiannelos: [C: 03+1] maps: tweak postgres configuration settings to use more resources [puppet] - 10https://gerrit.wikimedia.org/r/757424 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [11:19:18] !log roll-restarting maps services in eqiad for updates [11:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:11] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] maps: tweak postgres configuration settings to use more resources [puppet] - 10https://gerrit.wikimedia.org/r/757424 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [11:23:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T300402)', diff saved to https://phabricator.wikimedia.org/P19794 and previous config saved to /var/cache/conftool/dbconfig/20220201-112325-marostegui.json [11:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:28] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [11:23:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "Yep +1!" [puppet] - 10https://gerrit.wikimedia.org/r/758779 (https://phabricator.wikimedia.org/T299137) (owner: 10Elukey) [11:24:31] (03CR) 10Elukey: [V: 03+1 C: 03+2] ores::web: log XFF header as REMOTE_ADDR when available [puppet] - 10https://gerrit.wikimedia.org/r/758779 (https://phabricator.wikimedia.org/T299137) (owner: 10Elukey) [11:26:49] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 60 probes of 653 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:27:08] (03CR) 10Jbond: [C: 03+1] management: remove deprecated module [software/spicerack] - 10https://gerrit.wikimedia.org/r/757747 (owner: 10Volans) [11:29:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P19795 and previous config saved to /var/cache/conftool/dbconfig/20220201-112913-marostegui.json [11:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:01] (03CR) 10Volans: [C: 03+2] management: remove deprecated module [software/spicerack] - 10https://gerrit.wikimedia.org/r/757747 (owner: 10Volans) [11:31:03] !log roll restart ORES to pick up logging change (use XFF header when possible) - T299137 [11:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:06] T299137: Improve ORES observability - https://phabricator.wikimedia.org/T299137 [11:32:18] https://www.irccloud.com/pastebin/2stnYxpR/ [11:33:07] Anyone know what's wrong here? https://integration.wikimedia.org/ci/job/service-pipeline-test/11087/console [11:34:38] (03CR) 10Jbond: [C: 03+2] "LGTM thx" [puppet] - 10https://gerrit.wikimedia.org/r/758093 (owner: 10Majavah) [11:35:22] (03CR) 10Jbond: [C: 03+2] "thanks this was on my list <3 :)" [puppet] - 10https://gerrit.wikimedia.org/r/758095 (owner: 10Majavah) [11:36:42] (03CR) 10Jbond: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/748098 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [11:37:16] (03CR) 10Jbond: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/748111 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [11:37:51] (03CR) 10Jbond: [C: 03+1] Delete now unused analytics policy file [homer/public] - 10https://gerrit.wikimedia.org/r/758470 (owner: 10Ayounsi) [11:38:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P19796 and previous config saved to /var/cache/conftool/dbconfig/20220201-113830-marostegui.json [11:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:23] PROBLEM - Check systemd state on maps1008 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:51] (03Merged) 10jenkins-bot: management: remove deprecated module [software/spicerack] - 10https://gerrit.wikimedia.org/r/757747 (owner: 10Volans) [11:44:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P19797 and previous config saved to /var/cache/conftool/dbconfig/20220201-114418-marostegui.json [11:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:27] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:53] PROBLEM - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:49] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P19798 and previous config saved to /var/cache/conftool/dbconfig/20220201-115334-marostegui.json [11:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:08] (03PS1) 10Hashar: Revert "blubberoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/758805 (https://phabricator.wikimedia.org/T296046) [11:56:32] (03CR) 10Jbond: [C: 03+1] "LGTM, see optional nit" [software/spicerack] - 10https://gerrit.wikimedia.org/r/758558 (owner: 10Volans) [11:57:28] (03CR) 10Hashar: "Looks like the Blubber patch https://gerrit.wikimedia.org/r/749569 broke the build for mediawiki/services/cxserver so I am getting blubbe" [deployment-charts] - 10https://gerrit.wikimedia.org/r/758805 (https://phabricator.wikimedia.org/T296046) (owner: 10Hashar) [11:57:46] (Device rebooted) firing: Alert for device ps1-a8-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org [11:57:51] PROBLEM - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298558)', diff saved to https://phabricator.wikimedia.org/P19799 and previous config saved to /var/cache/conftool/dbconfig/20220201-115923-marostegui.json [11:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:27] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220201T1200). [12:00:04] dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:33] (I’m in a meeting and can’t deploy) [12:00:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] Revert "blubberoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/758805 (https://phabricator.wikimedia.org/T296046) (owner: 10Hashar) [12:01:04] i can deploy, or dcausse can self-serve? [12:01:09] o/ [12:01:19] urbanecm: sure I can deploy [12:01:24] go ahead then :) [12:01:39] ok doing :) [12:01:44] dcausse: I see an interesting code review comment, I assume that the dependency service was restarted already? [12:01:59] awight: yes [12:02:02] thanks! [12:02:40] (03PS1) 10Giuseppe Lavagetto: Add key stub for thanos-fe-combined.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/758826 [12:02:46] (Device rebooted) resolved: Device ps1-a8-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org [12:05:12] awight: actually I don't see the deploy on deployment-charts yesterday I just checked schema.wikimedia.org [12:06:45] dcausse: I was only mentioning due to "watchdog" instincts, sorry to hear there might be an actual problem! lmk if I can help in any way. [12:07:05] awight: thanks for making me double check! :) [12:07:36] * awight munches popcorn unhelpfully ;-) [12:07:49] :) [12:08:18] going to deploy a schema bump to eventgate-main instead [12:08:37] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:08:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T300402)', diff saved to https://phabricator.wikimedia.org/P19800 and previous config saved to /var/cache/conftool/dbconfig/20220201-120839-marostegui.json [12:08:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [12:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [12:08:43] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [12:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T300402)', diff saved to https://phabricator.wikimedia.org/P19801 and previous config saved to /var/cache/conftool/dbconfig/20220201-120847-marostegui.json [12:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:39] (03PS2) 10DCausse: rdf-streaming-updater: add the reconciliation stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753788 (https://phabricator.wikimedia.org/T279541) [12:10:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T300402)', diff saved to https://phabricator.wikimedia.org/P19802 and previous config saved to /var/cache/conftool/dbconfig/20220201-121051-marostegui.json [12:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "blubberoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/758805 (https://phabricator.wikimedia.org/T296046) (owner: 10Hashar) [12:21:59] <_joe_> hashar: reverting as soon as it's merged [12:22:13] _joe_: awesome thank you! [12:22:23] Nikerabbit: kart_: the Blubber faulty image is being rolledback :) [12:23:12] (03PS1) 10Hnowlan: maps: increase postgres resources across cluster [puppet] - 10https://gerrit.wikimedia.org/r/758829 (https://phabricator.wikimedia.org/T298246) [12:23:38] (03Merged) 10jenkins-bot: Revert "blubberoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/758805 (https://phabricator.wikimedia.org/T296046) (owner: 10Hashar) [12:25:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P19803 and previous config saved to /var/cache/conftool/dbconfig/20220201-122556-marostegui.json [12:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:18] (03PS1) 10Arturo Borrero Gonzalez: toolsws: utils: format with black [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/758831 [12:28:20] (03PS1) 10Arturo Borrero Gonzalez: toolsws: add uwsgi-python3 webservice type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/758832 (https://phabricator.wikimedia.org/T300501) [12:28:21] _joe_: hashar Thanks! [12:28:35] kart_: please recheck the cxserver build ? ;) [12:28:44] <_joe_> kart_: wait a few minutes [12:28:56] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply on staging [12:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:59] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply on production [12:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:13] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: sync on staging [12:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:19] <_joe_> I'm rolling it out now [12:29:59] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply on production [12:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:01] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply on staging [12:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:20] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: sync on production [12:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:46] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply on production [12:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:49] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply on staging [12:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:23] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: sync on production [12:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:42] <_joe_> jelto: i think we're logging too much now, heh [12:32:04] <_joe_> I think I chose the wrong hook for DONE [12:32:46] <_joe_> kart_: blubberoid is rolled back [12:34:15] (03CR) 10Jgiannelos: [C: 03+1] maps: increase postgres resources across cluster [puppet] - 10https://gerrit.wikimedia.org/r/758829 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [12:34:24] _joe_: cool [12:35:09] backups are getting a bit delayed due to 1st of month full runs, I am monitoring the situation [12:39:11] !log installing openjdk-11 security updates [12:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [12:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [12:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [12:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [12:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [12:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [12:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T298558)', diff saved to https://phabricator.wikimedia.org/P19804 and previous config saved to /var/cache/conftool/dbconfig/20220201-124004-marostegui.json [12:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:07] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [12:41:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P19805 and previous config saved to /var/cache/conftool/dbconfig/20220201-124100-marostegui.json [12:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298558)', diff saved to https://phabricator.wikimedia.org/P19806 and previous config saved to /var/cache/conftool/dbconfig/20220201-124110-marostegui.json [12:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:06] !log mmandere@cumin1001 START - Cookbook sre.ganeti.makevm for new host durum6001.drmrs.wmnet [12:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:58] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add key stub for thanos-fe-combined.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/758826 (owner: 10Giuseppe Lavagetto) [12:49:31] (03PS3) 10Giuseppe Lavagetto: thanos::frontend: fix envoy configuration [puppet] - 10https://gerrit.wikimedia.org/r/757452 (https://phabricator.wikimedia.org/T300119) [12:50:59] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33535/console" [puppet] - 10https://gerrit.wikimedia.org/r/757452 (https://phabricator.wikimedia.org/T300119) (owner: 10Giuseppe Lavagetto) [12:52:27] (03CR) 10Jbond: profile::kafka::broker: add pki_intermediate_name parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757800 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [12:52:53] !log mmandere@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum6001.drmrs.wmnet [12:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T300402)', diff saved to https://phabricator.wikimedia.org/P19807 and previous config saved to /var/cache/conftool/dbconfig/20220201-125605-marostegui.json [12:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:09] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [12:56:09] !log Set innodb_adaptive_hash_index=OFF on: db1129 es1029 es1030 es1028 es1020 es1023 T268869 [12:56:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [12:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:12] T268869: Consider setting innodb_adaptive_hash_index=OFF by default - https://phabricator.wikimedia.org/T268869 [12:56:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [12:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [12:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P19808 and previous config saved to /var/cache/conftool/dbconfig/20220201-125615-marostegui.json [12:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [12:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:11] Thanks, _joe_ and hasharm for helping us. Quite unlucky deployment day for cxserver with multiple issues :) [12:57:54] 10SRE, 10Traffic, 10Patch-For-Review: Create Ganeti VMs for durum in drmrs - https://phabricator.wikimedia.org/T300158 (10MMandere) [12:57:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [12:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [12:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T300402)', diff saved to https://phabricator.wikimedia.org/P19809 and previous config saved to /var/cache/conftool/dbconfig/20220201-125805-marostegui.json [12:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T300402)', diff saved to https://phabricator.wikimedia.org/P19810 and previous config saved to /var/cache/conftool/dbconfig/20220201-130010-marostegui.json [13:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:24] !log Restarted Jenkins on releases1002.eqiad.wmnet [13:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:53] (03PS1) 10MMandere: install_server: Add drmrs durum first instance [puppet] - 10https://gerrit.wikimedia.org/r/758836 (https://phabricator.wikimedia.org/T300158) [13:06:49] (03CR) 10Majavah: [C: 03+2] toolsws: utils: format with black [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/758831 (owner: 10Arturo Borrero Gonzalez) [13:08:09] (03Merged) 10jenkins-bot: toolsws: utils: format with black [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/758831 (owner: 10Arturo Borrero Gonzalez) [13:09:24] !log Restarting Gerrit [13:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:28] !log Restarting CI Jenkins [13:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:18] (03CR) 10Majavah: [C: 04-1] "You need to update the various type definitions in toolsws/backends/kubernetes.py and in toolsws/backends/gridengine.py, otherwise the new" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/758832 (https://phabricator.wikimedia.org/T300501) (owner: 10Arturo Borrero Gonzalez) [13:11:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P19812 and previous config saved to /var/cache/conftool/dbconfig/20220201-131119-marostegui.json [13:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:17] (03CR) 10MMandere: "```" [puppet] - 10https://gerrit.wikimedia.org/r/758836 (https://phabricator.wikimedia.org/T300158) (owner: 10MMandere) [13:13:29] (03CR) 10MMandere: [C: 03+2] install_server: Add drmrs durum first instance [puppet] - 10https://gerrit.wikimedia.org/r/758836 (https://phabricator.wikimedia.org/T300158) (owner: 10MMandere) [13:15:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P19813 and previous config saved to /var/cache/conftool/dbconfig/20220201-131515-marostegui.json [13:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:06] (03PS4) 10MMandere: site: add role for durum hosts in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/757741 (https://phabricator.wikimedia.org/T300158) (owner: 10Ssingh) [13:16:21] (03CR) 10MMandere: [V: 03+2 C: 03+2] site: add role for durum hosts in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/757741 (https://phabricator.wikimedia.org/T300158) (owner: 10Ssingh) [13:19:20] (03Abandoned) 10Majavah: exec-manage: update example host and fix list command [puppet] - 10https://gerrit.wikimedia.org/r/757496 (owner: 10Majavah) [13:26:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298558)', diff saved to https://phabricator.wikimedia.org/P19814 and previous config saved to /var/cache/conftool/dbconfig/20220201-132624-marostegui.json [13:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:28] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [13:26:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [13:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [13:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [13:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [13:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [13:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [13:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T298558)', diff saved to https://phabricator.wikimedia.org/P19815 and previous config saved to /var/cache/conftool/dbconfig/20220201-132652-marostegui.json [13:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298558)', diff saved to https://phabricator.wikimedia.org/P19816 and previous config saved to /var/cache/conftool/dbconfig/20220201-132858-marostegui.json [13:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P19817 and previous config saved to /var/cache/conftool/dbconfig/20220201-133020-marostegui.json [13:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:51] (03CR) 10Arturo Borrero Gonzalez: "I'm sorry I forgot about this patch :-(" [puppet] - 10https://gerrit.wikimedia.org/r/757496 (owner: 10Majavah) [13:32:47] !log btullis@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [13:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:35] (03PS1) 10Kosta Harlan: linkrecommendation: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/758850 [13:37:02] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/758850 (owner: 10Kosta Harlan) [13:38:55] !log btullis@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [13:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:19] (03PS1) 10Arturo Borrero Gonzalez: toolforge: automated tests: use uwsgi-plain webservice to test the generic web grid [puppet] - 10https://gerrit.wikimedia.org/r/758851 (https://phabricator.wikimedia.org/T300501) [13:39:57] (03CR) 10jerkins-bot: [V: 04-1] toolforge: automated tests: use uwsgi-plain webservice to test the generic web grid [puppet] - 10https://gerrit.wikimedia.org/r/758851 (https://phabricator.wikimedia.org/T300501) (owner: 10Arturo Borrero Gonzalez) [13:40:20] (03PS2) 10Arturo Borrero Gonzalez: toolforge: automated tests: use uwsgi-plain webservice to test the generic web grid [puppet] - 10https://gerrit.wikimedia.org/r/758851 (https://phabricator.wikimedia.org/T300501) [13:40:46] (03Merged) 10jenkins-bot: linkrecommendation: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/758850 (owner: 10Kosta Harlan) [13:40:56] (03CR) 10jerkins-bot: [V: 04-1] toolforge: automated tests: use uwsgi-plain webservice to test the generic web grid [puppet] - 10https://gerrit.wikimedia.org/r/758851 (https://phabricator.wikimedia.org/T300501) (owner: 10Arturo Borrero Gonzalez) [13:41:17] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply on staging [13:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:19] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on internal [13:41:20] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on external [13:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:37] !log btullis@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [13:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:11] (03PS3) 10Arturo Borrero Gonzalez: toolforge: automated tests: use uwsgi-plain ws to test the generic web grid [puppet] - 10https://gerrit.wikimedia.org/r/758851 (https://phabricator.wikimedia.org/T300501) [13:43:12] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply on staging [13:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:15] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on internal [13:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:16] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on external [13:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:48] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: sync on staging [13:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P19818 and previous config saved to /var/cache/conftool/dbconfig/20220201-134403-marostegui.json [13:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T300402)', diff saved to https://phabricator.wikimedia.org/P19819 and previous config saved to /var/cache/conftool/dbconfig/20220201-134524-marostegui.json [13:45:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [13:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:28] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [13:45:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [13:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [13:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [13:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:10] !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply on internal [13:47:10] !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply on external [13:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:12] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply on staging [13:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [13:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [13:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:34] (03CR) 10Jbond: [C: 03+2] hieradata: pcc: add cloudinfra-internal-puppetmaster02 key [puppet] - 10https://gerrit.wikimedia.org/r/758092 (owner: 10Majavah) [13:47:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T300402)', diff saved to https://phabricator.wikimedia.org/P19820 and previous config saved to /var/cache/conftool/dbconfig/20220201-134740-marostegui.json [13:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:45] !log btullis@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [13:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: automated tests: use uwsgi-plain ws to test the generic web grid [puppet] - 10https://gerrit.wikimedia.org/r/758851 (https://phabricator.wikimedia.org/T300501) (owner: 10Arturo Borrero Gonzalez) [13:48:23] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: sync on external [13:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:39] !log btullis@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [13:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:42] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: sync on internal [13:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T300402)', diff saved to https://phabricator.wikimedia.org/P19821 and previous config saved to /var/cache/conftool/dbconfig/20220201-134942-marostegui.json [13:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:04] (03PS12) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [13:50:48] !log kharlan@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply on external [13:50:48] !log kharlan@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply on internal [13:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:52] !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply on staging [13:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:06] !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: sync on external [13:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:39] (03CR) 10jerkins-bot: [V: 04-1] O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [13:52:45] (03CR) 10Kosta Harlan: "Ic9e6ae76aaf8c92b53eb47519493dfa472c0147b was needed for this change to take effect." [deployment-charts] - 10https://gerrit.wikimedia.org/r/752608 (https://phabricator.wikimedia.org/T298857) (owner: 10Gergő Tisza) [13:54:46] !log btullis@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [13:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:58] !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: sync on internal [13:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P19822 and previous config saved to /var/cache/conftool/dbconfig/20220201-135908-marostegui.json [13:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P19823 and previous config saved to /var/cache/conftool/dbconfig/20220201-140447-marostegui.json [14:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:48] (03PS2) 10Volans: dhcp: case-insensitive match if Dell serial number [software/spicerack] - 10https://gerrit.wikimedia.org/r/758558 [14:05:50] (03CR) 10Volans: "addressed comments" [software/spicerack] - 10https://gerrit.wikimedia.org/r/758558 (owner: 10Volans) [14:09:53] (03PS13) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [14:10:43] (03CR) 10jerkins-bot: [V: 04-1] O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [14:14:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298558)', diff saved to https://phabricator.wikimedia.org/P19824 and previous config saved to /var/cache/conftool/dbconfig/20220201-141413-marostegui.json [14:14:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [14:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [14:14:17] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [14:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T298558)', diff saved to https://phabricator.wikimedia.org/P19825 and previous config saved to /var/cache/conftool/dbconfig/20220201-141420-marostegui.json [14:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298558)', diff saved to https://phabricator.wikimedia.org/P19826 and previous config saved to /var/cache/conftool/dbconfig/20220201-141527-marostegui.json [14:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P19827 and previous config saved to /var/cache/conftool/dbconfig/20220201-141952-marostegui.json [14:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-serve2005.mgmt.codfw.wmnet with reboot policy FORCED [14:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:02] (03CR) 10jerkins-bot: [V: 04-1] dhcp: case-insensitive match if Dell serial number [software/spicerack] - 10https://gerrit.wikimedia.org/r/758558 (owner: 10Volans) [14:28:35] (03PS1) 10Arturo Borrero Gonzalez: toolforge: automated-tests: cosmetic fixes [puppet] - 10https://gerrit.wikimedia.org/r/758859 [14:29:18] (03PS1) 10Filippo Giunchedi: sre: check agent resources too in PuppetFailure [alerts] - 10https://gerrit.wikimedia.org/r/758860 (https://phabricator.wikimedia.org/T299628) [14:29:42] 10SRE, 10Observability-Alerting, 10Patch-For-Review, 10User-fgiunchedi: Debug / fine tune puppet failed metrics and alerts on alert* hosts - https://phabricator.wikimedia.org/T299628 (10fgiunchedi) >>! In T299628#7635655, @Majavah wrote: > I've noticed that when puppet fails to compile catalog, it won't sh... [14:30:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P19828 and previous config saved to /var/cache/conftool/dbconfig/20220201-143031-marostegui.json [14:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:34] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve2005.mgmt.codfw.wmnet with reboot policy FORCED [14:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:50] (03PS2) 10Hnowlan: maps: increase postgres resources across eqiad [puppet] - 10https://gerrit.wikimedia.org/r/758829 (https://phabricator.wikimedia.org/T298246) [14:34:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T300402)', diff saved to https://phabricator.wikimedia.org/P19829 and previous config saved to /var/cache/conftool/dbconfig/20220201-143456-marostegui.json [14:34:58] (03PS1) 10KartikMistry: Update cxserver to 2022-02-01-141918-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/758862 (https://phabricator.wikimedia.org/T298592) [14:34:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [14:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [14:35:00] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [14:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T300402)', diff saved to https://phabricator.wikimedia.org/P19830 and previous config saved to /var/cache/conftool/dbconfig/20220201-143504-marostegui.json [14:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T300402)', diff saved to https://phabricator.wikimedia.org/P19831 and previous config saved to /var/cache/conftool/dbconfig/20220201-143809-marostegui.json [14:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:42] 10SRE, 10ops-ulsfo, 10Traffic: SMART errors on cp4031 - https://phabricator.wikimedia.org/T300493 (10Vgutierrez) p:05Triage→03Medium [14:40:22] (03PS1) 10Muehlenhoff: Failover idp to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/758863 [14:42:14] 10SRE, 10ops-ulsfo, 10Traffic: SMART errors on cp4031 - https://phabricator.wikimedia.org/T300493 (10Vgutierrez) @RobH just pinging you to make you aware of this one for your next visit to the datacenter [14:44:51] (03CR) 10Muehlenhoff: [C: 03+2] Failover idp to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/758863 (owner: 10Muehlenhoff) [14:44:56] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-02-01-141918-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/758862 (https://phabricator.wikimedia.org/T298592) (owner: 10KartikMistry) [14:45:21] Deploying cxserver fix soon. [14:45:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P19832 and previous config saved to /var/cache/conftool/dbconfig/20220201-144536-marostegui.json [14:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:00] 10SRE, 10ops-ulsfo, 10Traffic: SMART errors on cp4031 - https://phabricator.wikimedia.org/T300493 (10RobH) a:03RobH I'll put in a self dispatch for a new SSD and go swap when it shows up. Will update later today. [14:48:43] (03Merged) 10jenkins-bot: Update cxserver to 2022-02-01-141918-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/758862 (https://phabricator.wikimedia.org/T298592) (owner: 10KartikMistry) [14:50:06] PROBLEM - Check systemd state on durum6001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:52:20] !log mmandere@cumin1001 START - Cookbook sre.ganeti.makevm for new host durum6002.drmrs.wmnet [14:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:45] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply on staging [14:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:47] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply on production [14:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P19833 and previous config saved to /var/cache/conftool/dbconfig/20220201-145314-marostegui.json [14:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:22] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: sync on staging [14:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:32] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:55:58] RECOVERY - Check systemd state on centrallog1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:00] RECOVERY - Check systemd state on centrallog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:02] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply on production [14:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:05] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply on staging [14:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:12] RECOVERY - Check unit status of prune_old_srv_syslog_directories on centrallog2002 is OK: OK: Status of the systemd unit prune_old_srv_syslog_directories https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:58:28] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: sync on production [14:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:33] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply on production [14:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:35] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply on staging [14:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298558)', diff saved to https://phabricator.wikimedia.org/P19834 and previous config saved to /var/cache/conftool/dbconfig/20220201-150041-marostegui.json [15:00:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [15:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:44] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [15:00:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [15:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T298558)', diff saved to https://phabricator.wikimedia.org/P19835 and previous config saved to /var/cache/conftool/dbconfig/20220201-150049-marostegui.json [15:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:32] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: sync on production [15:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298558)', diff saved to https://phabricator.wikimedia.org/P19836 and previous config saved to /var/cache/conftool/dbconfig/20220201-150155-marostegui.json [15:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:11] (03CR) 10Filippo Giunchedi: [C: 03+1] "Can't say I fully understand why this works, but let's go!" [puppet] - 10https://gerrit.wikimedia.org/r/757452 (https://phabricator.wikimedia.org/T300119) (owner: 10Giuseppe Lavagetto) [15:03:06] RECOVERY - Check unit status of prune_old_srv_syslog_directories on centrallog1001 is OK: OK: Status of the systemd unit prune_old_srv_syslog_directories https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:05:14] !log mmandere@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum6002.drmrs.wmnet [15:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:54] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1016.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [15:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1016.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [15:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P19837 and previous config saved to /var/cache/conftool/dbconfig/20220201-150818-marostegui.json [15:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:54] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) One more server is ready and downtimed; ganeti1016 [15:09:07] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) [15:09:46] (03CR) 10Klausman: [C: 03+1] "LGTM, but where do the numbers come from, just basic first estimates?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/757675 (https://phabricator.wikimedia.org/T294414) (owner: 10Elukey) [15:10:08] !log update scap to 4.2.2 on all hosts - T300392 [15:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:11] T300392: Deploy Scap version 4.2.2 - https://phabricator.wikimedia.org/T300392 [15:11:11] Ah forgot to log 2 deployments! [15:11:38] (03PS1) 10JMeybohm: prometheus-exporters/statsd: Run with --log.level=warn [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/758868 (https://phabricator.wikimedia.org/T300629) [15:11:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1010.eqiad.wmnet [15:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:34] !log Deployed Flores MT for cxserver + Updated cxserver to 2022-01-13-174407-production (T298584, T292412, T292415, T298679, T298752) + Updated cxserver to 2022-02-01-141918-production (T298592) [15:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:44] T298584: Deploy Flores MT in Production - https://phabricator.wikimedia.org/T298584 [15:13:44] T298752: cxserver gives empty result for translation of template without about attribute - https://phabricator.wikimedia.org/T298752 [15:13:44] T298592: Machine translation fails for paragraphs with reference with images - https://phabricator.wikimedia.org/T298592 [15:13:44] T298679: Wrong section segmentation causes infobox presented as table - https://phabricator.wikimedia.org/T298679 [15:13:45] T292415: Create Wikipedia Paiwan - https://phabricator.wikimedia.org/T292415 [15:13:45] T292412: Technical exploration for the integration of Flores MT models - https://phabricator.wikimedia.org/T292412 [15:15:24] (03PS1) 10MMandere: install_server: Add drmrs durum second instance [puppet] - 10https://gerrit.wikimedia.org/r/758869 (https://phabricator.wikimedia.org/T300158) [15:16:10] (03CR) 10Elukey: helmfile.d: add circuit breaking settings for ml-serve's egress (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/757675 (https://phabricator.wikimedia.org/T294414) (owner: 10Elukey) [15:17:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P19838 and previous config saved to /var/cache/conftool/dbconfig/20220201-151700-marostegui.json [15:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1010.eqiad.wmnet [15:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:06] (03CR) 10Ssingh: [C: 03+1] install_server: Add drmrs durum second instance [puppet] - 10https://gerrit.wikimedia.org/r/758869 (https://phabricator.wikimedia.org/T300158) (owner: 10MMandere) [15:20:53] (03CR) 10MMandere: [C: 03+2] install_server: Add drmrs durum second instance [puppet] - 10https://gerrit.wikimedia.org/r/758869 (https://phabricator.wikimedia.org/T300158) (owner: 10MMandere) [15:21:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1010.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [15:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:37] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [15:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:46] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 09s) [15:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T300402)', diff saved to https://phabricator.wikimedia.org/P19839 and previous config saved to /var/cache/conftool/dbconfig/20220201-152323-marostegui.json [15:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:26] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [15:24:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-serve2006.mgmt.codfw.wmnet with reboot policy FORCED [15:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [15:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [15:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [15:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [15:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P19840 and previous config saved to /var/cache/conftool/dbconfig/20220201-153204-marostegui.json [15:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve2006.mgmt.codfw.wmnet with reboot policy FORCED [15:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:04] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [15:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:12] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 08s) [15:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1010.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [15:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:36] (03CR) 10Hnowlan: [C: 03+2] maps: increase postgres resources across eqiad [puppet] - 10https://gerrit.wikimedia.org/r/758829 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [15:39:59] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [15:47:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298558)', diff saved to https://phabricator.wikimedia.org/P19841 and previous config saved to /var/cache/conftool/dbconfig/20220201-154709-marostegui.json [15:47:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [15:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [15:47:13] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [15:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T298558)', diff saved to https://phabricator.wikimedia.org/P19842 and previous config saved to /var/cache/conftool/dbconfig/20220201-154716-marostegui.json [15:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298558)', diff saved to https://phabricator.wikimedia.org/P19843 and previous config saved to /var/cache/conftool/dbconfig/20220201-155023-marostegui.json [15:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:26] (03CR) 10JMeybohm: [C: 03+2] _ingress_helpers: HTTPRoute does not require a destination [deployment-charts] - 10https://gerrit.wikimedia.org/r/757934 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [15:52:34] (03CR) 10JMeybohm: [C: 03+2] Allow deploy users to create ingress and certificate objects [deployment-charts] - 10https://gerrit.wikimedia.org/r/757898 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [15:55:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-serve2007.mgmt.codfw.wmnet with reboot policy FORCED [15:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:00] 10SRE-OnFire: 2021-10-25 s3 db recentchanges replica - https://phabricator.wikimedia.org/T295154 (10LSobanski) a:03LSobanski [15:56:12] 10SRE-OnFire (FY2021/2022-Q2): 2021-10-25 s3 db recentchanges replica - https://phabricator.wikimedia.org/T295154 (10LSobanski) [15:56:14] (03Merged) 10jenkins-bot: Allow deploy users to create ingress and certificate objects [deployment-charts] - 10https://gerrit.wikimedia.org/r/757898 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [15:56:17] (03Merged) 10jenkins-bot: _ingress_helpers: HTTPRoute does not require a destination [deployment-charts] - 10https://gerrit.wikimedia.org/r/757934 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [15:58:46] (03CR) 10CDanis: [C: 03+1] sre: check agent resources too in PuppetFailure [alerts] - 10https://gerrit.wikimedia.org/r/758860 (https://phabricator.wikimedia.org/T299628) (owner: 10Filippo Giunchedi) [16:00:47] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missi [16:00:47] [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:04:21] (03PS1) 10Vgutierrez: site: Reimage cp2039 as cache::text_envoy [puppet] - 10https://gerrit.wikimedia.org/r/758879 (https://phabricator.wikimedia.org/T271421) [16:04:51] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [16:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:02] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 10s) [16:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P19844 and previous config saved to /var/cache/conftool/dbconfig/20220201-160528-marostegui.json [16:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: automated-tests: cosmetic fixes [puppet] - 10https://gerrit.wikimedia.org/r/758859 (owner: 10Arturo Borrero Gonzalez) [16:09:30] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [16:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:34] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 03s) [16:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:43] !log depool cp2039 to be reimaged as cache::text_envoy - T271421 [16:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:45] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [16:11:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [16:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [16:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve2007.mgmt.codfw.wmnet with reboot policy FORCED [16:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:34] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2039 as cache::text_envoy [puppet] - 10https://gerrit.wikimedia.org/r/758879 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [16:11:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:51] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp2039.codfw.wmnet with OS buster [16:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:00] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp2039.codfw.wmnet with OS buster [16:13:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [16:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [16:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T300402)', diff saved to https://phabricator.wikimedia.org/P19845 and previous config saved to /var/cache/conftool/dbconfig/20220201-161353-marostegui.json [16:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:57] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [16:14:31] (03PS1) 10JMeybohm: Create certificates for different FQDN's in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/758880 (https://phabricator.wikimedia.org/T290966) [16:14:33] (03PS1) 10JMeybohm: Deploy ingress components to staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/758881 (https://phabricator.wikimedia.org/T290966) [16:15:07] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:17:04] (03CR) 10jerkins-bot: [V: 04-1] Create certificates for different FQDN's in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/758880 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [16:17:13] (03CR) 10jerkins-bot: [V: 04-1] Deploy ingress components to staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/758881 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [16:19:57] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 105 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [16:20:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P19846 and previous config saved to /var/cache/conftool/dbconfig/20220201-162033-marostegui.json [16:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:23] (03PS2) 10JMeybohm: Create certificates for different FQDN's in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/758880 (https://phabricator.wikimedia.org/T290966) [16:22:25] (03PS2) 10JMeybohm: Deploy ingress components to staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/758881 (https://phabricator.wikimedia.org/T290966) [16:24:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 1%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P19847 and previous config saved to /var/cache/conftool/dbconfig/20220201-162454-root.json [16:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:29] (03CR) 10jerkins-bot: [V: 04-1] Create certificates for different FQDN's in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/758880 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [16:26:02] what the... [16:26:13] (03CR) 10jerkins-bot: [V: 04-1] Deploy ingress components to staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/758881 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [16:27:07] 10SRE, 10Commons, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, and 4 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Lucas_Werkmeister_WMDE) …and this has to block a move from incorrect tracking to correct tracking? I’m n... [16:30:56] (03PS1) 10Elukey: helmfild.d: add STORAGE_URI setting to the ML draftquality's transformer [deployment-charts] - 10https://gerrit.wikimedia.org/r/758883 [16:31:59] ....oh fuck you google you signed me out after i made a dozen changes and didnt keep them to a doc [16:32:11] (03CR) 10Accraze: [C: 03+1] helmfild.d: add STORAGE_URI setting to the ML draftquality's transformer [deployment-charts] - 10https://gerrit.wikimedia.org/r/758883 (owner: 10Elukey) [16:32:14] * robh meant to grumble that in private oh well [16:32:16] Bogus! [16:32:43] (03CR) 10Dzahn: "ooh.. I was about to click Edit to add the bug link. But then I realized it's already merged :) wohoo. thanks" [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [16:33:41] (03PS1) 10JHathaway: Michael Holloway: remove mail alias and icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/758884 [16:33:52] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10Dzahn) removed a bunch of "cron" strings from snapshot: https://gerrit.wikimedia.org/r/c/operations/puppet/+/736074 [16:34:52] (03CR) 10jerkins-bot: [V: 04-1] Michael Holloway: remove mail alias and icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/758884 (owner: 10JHathaway) [16:35:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298558)', diff saved to https://phabricator.wikimedia.org/P19848 and previous config saved to /var/cache/conftool/dbconfig/20220201-163537-marostegui.json [16:35:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [16:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [16:35:42] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [16:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T298558)', diff saved to https://phabricator.wikimedia.org/P19849 and previous config saved to /var/cache/conftool/dbconfig/20220201-163545-marostegui.json [16:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:13] (03PS2) 10JHathaway: Michael Holloway: remove mail alias and icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/758884 [16:36:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298558)', diff saved to https://phabricator.wikimedia.org/P19850 and previous config saved to /var/cache/conftool/dbconfig/20220201-163651-marostegui.json [16:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:24] (03CR) 10JHathaway: [C: 03+2] Michael Holloway: remove mail alias and icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/758884 (owner: 10JHathaway) [16:38:31] (03CR) 10Dzahn: [C: 03+2] "yep, that's the IP that I assigned to gitlab-prod-1001 in Horizon" [puppet] - 10https://gerrit.wikimedia.org/r/758793 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [16:39:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 5%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P19851 and previous config saved to /var/cache/conftool/dbconfig/20220201-163958-root.json [16:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:08] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-serve2008.mgmt.codfw.wmnet with reboot policy FORCED [16:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:01] (03CR) 10Elukey: [C: 03+2] helmfild.d: add STORAGE_URI setting to the ML draftquality's transformer [deployment-charts] - 10https://gerrit.wikimedia.org/r/758883 (owner: 10Elukey) [16:44:10] (03PS1) 10Andrew Bogott: Revert "Neutron/Victoria/Bullseye: try removing iptables entirely" [puppet] - 10https://gerrit.wikimedia.org/r/758885 [16:44:12] (03PS1) 10Andrew Bogott: cloudnet node: adjust sysctl parameters that are only meant for dataplane [puppet] - 10https://gerrit.wikimedia.org/r/758886 [16:44:22] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron: l3_agent: some sysctl parameters are only on dataplane [puppet] - 10https://gerrit.wikimedia.org/r/758887 [16:44:57] (03CR) 10jerkins-bot: [V: 04-1] Revert "Neutron/Victoria/Bullseye: try removing iptables entirely" [puppet] - 10https://gerrit.wikimedia.org/r/758885 (owner: 10Andrew Bogott) [16:45:12] (03CR) 10jerkins-bot: [V: 04-1] openstack: neutron: l3_agent: some sysctl parameters are only on dataplane [puppet] - 10https://gerrit.wikimedia.org/r/758887 (owner: 10Arturo Borrero Gonzalez) [16:45:19] (03CR) 10jerkins-bot: [V: 04-1] cloudnet node: adjust sysctl parameters that are only meant for dataplane [puppet] - 10https://gerrit.wikimedia.org/r/758886 (owner: 10Andrew Bogott) [16:47:10] (03PS2) 10Arturo Borrero Gonzalez: openstack: neutron: l3_agent: some sysctl parameters are only on dataplane [puppet] - 10https://gerrit.wikimedia.org/r/758887 [16:48:21] (03CR) 10Ebernhardson: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33536/console" [puppet] - 10https://gerrit.wikimedia.org/r/758886 (owner: 10Andrew Bogott) [16:48:22] Pushing Junos image package to the host... complete now Extracting the package ... [16:49:29] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [16:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:35] (03PS1) 10Dzahn: gitlab: add profile::gitlab::monitoring_whitelist in cloud Hiera [puppet] - 10https://gerrit.wikimedia.org/r/758889 (https://phabricator.wikimedia.org/T297411) [16:49:43] (03CR) 10jerkins-bot: [V: 04-1] openstack: neutron: l3_agent: some sysctl parameters are only on dataplane [puppet] - 10https://gerrit.wikimedia.org/r/758887 (owner: 10Arturo Borrero Gonzalez) [16:49:45] (03Abandoned) 10Andrew Bogott: cloudnet node: adjust sysctl parameters that are only meant for dataplane [puppet] - 10https://gerrit.wikimedia.org/r/758886 (owner: 10Andrew Bogott) [16:50:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve2008.mgmt.codfw.wmnet with reboot policy FORCED [16:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:25] !log rebooting pfw3a-codfw and pfw3b for JUNOS upgrade [16:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:50] (03PS2) 10Dzahn: gitlab: add profile::gitlab::monitoring_whitelist in cloud Hiera [puppet] - 10https://gerrit.wikimedia.org/r/758889 (https://phabricator.wikimedia.org/T297411) [16:51:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P19852 and previous config saved to /var/cache/conftool/dbconfig/20220201-165156-marostegui.json [16:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:38] (03CR) 10Dzahn: [C: 03+2] gitlab: add profile::gitlab::monitoring_whitelist in cloud Hiera [puppet] - 10https://gerrit.wikimedia.org/r/758889 (https://phabricator.wikimedia.org/T297411) (owner: 10Dzahn) [16:53:33] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:53:53] (03PS1) 10Bernard Wang: Turn on wgVectorLanguageAlertInSidebar for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758890 [16:55:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 10%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P19854 and previous config saved to /var/cache/conftool/dbconfig/20220201-165501-root.json [16:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:39] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:58:40] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2039.codfw.wmnet with OS buster [16:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:49] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp2039.codfw.wmnet with OS buster completed: - cp2039 (**WARN*... [16:59:25] (03PS1) 10Dzahn: gitlab: parameter for exporters expects Hash but is array by default [puppet] - 10https://gerrit.wikimedia.org/r/758894 (https://phabricator.wikimedia.org/T297411) [17:00:01] (03PS2) 10Arturo Borrero Gonzalez: toolsws: add uwsgi-python3 webservice type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/758832 (https://phabricator.wikimedia.org/T300501) [17:00:05] jbond and rzl: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220201T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:53] (03PS2) 10Dzahn: gitlab: parameter for exporters expects Hash but is array by default [puppet] - 10https://gerrit.wikimedia.org/r/758894 (https://phabricator.wikimedia.org/T297411) [17:01:23] (03CR) 10jerkins-bot: [V: 04-1] toolsws: add uwsgi-python3 webservice type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/758832 (https://phabricator.wikimedia.org/T300501) (owner: 10Arturo Borrero Gonzalez) [17:02:05] (03PS3) 10JMeybohm: Create certificates for different FQDN's in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/758880 (https://phabricator.wikimedia.org/T290966) [17:02:07] (03PS3) 10JMeybohm: Deploy ingress components to staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/758881 (https://phabricator.wikimedia.org/T290966) [17:02:12] (03PS1) 10Majavah: Revert "spec_helper: increase max supported_os to bullseye" [puppet] - 10https://gerrit.wikimedia.org/r/758906 [17:02:48] (03PS2) 10Majavah: Revert "Neutron/Victoria/Bullseye: try removing iptables entirely" [puppet] - 10https://gerrit.wikimedia.org/r/758885 (owner: 10Andrew Bogott) [17:03:50] (03CR) 10jerkins-bot: [V: 04-1] gitlab: parameter for exporters expects Hash but is array by default [puppet] - 10https://gerrit.wikimedia.org/r/758894 (https://phabricator.wikimedia.org/T297411) (owner: 10Dzahn) [17:03:56] (03PS3) 10Arturo Borrero Gonzalez: toolsws: add uwsgi-python3 webservice type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/758832 (https://phabricator.wikimedia.org/T300501) [17:04:17] (03CR) 10Andrew Bogott: [C: 03+2] Revert "spec_helper: increase max supported_os to bullseye" [puppet] - 10https://gerrit.wikimedia.org/r/758906 (owner: 10Majavah) [17:04:33] (03PS4) 10Arturo Borrero Gonzalez: toolsws: add uwsgi-python3 webservice type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/758832 (https://phabricator.wikimedia.org/T300501) [17:05:38] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/758887 (owner: 10Arturo Borrero Gonzalez) [17:05:56] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Neutron/Victoria/Bullseye: try removing iptables entirely" [puppet] - 10https://gerrit.wikimedia.org/r/758885 (owner: 10Andrew Bogott) [17:05:58] (03CR) 10Arturo Borrero Gonzalez: [C: 04-2] "On second thoughts, I don't think we should merge this patch. We should favor movement into kubernetes rather than enable new features on " [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/758832 (https://phabricator.wikimedia.org/T300501) (owner: 10Arturo Borrero Gonzalez) [17:06:36] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/758894 (https://phabricator.wikimedia.org/T297411) (owner: 10Dzahn) [17:06:59] (03Abandoned) 10Arturo Borrero Gonzalez: toolsws: add uwsgi-python3 webservice type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/758832 (https://phabricator.wikimedia.org/T300501) (owner: 10Arturo Borrero Gonzalez) [17:07:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P19855 and previous config saved to /var/cache/conftool/dbconfig/20220201-170701-marostegui.json [17:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:40] (03CR) 10Dzahn: "currently no idea why the jenkins-bot -1 is on here" [puppet] - 10https://gerrit.wikimedia.org/r/758894 (https://phabricator.wikimedia.org/T297411) (owner: 10Dzahn) [17:08:55] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64701/IPv4: Idle - frack-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:09:31] (03CR) 10Dzahn: "well, recheck helped, a bad looking but temp error that is now gone" [puppet] - 10https://gerrit.wikimedia.org/r/758894 (https://phabricator.wikimedia.org/T297411) (owner: 10Dzahn) [17:09:58] (03CR) 10Andrew Bogott: [C: 03+2] openstack: neutron: l3_agent: some sysctl parameters are only on dataplane [puppet] - 10https://gerrit.wikimedia.org/r/758887 (owner: 10Arturo Borrero Gonzalez) [17:10:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 20%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P19856 and previous config saved to /var/cache/conftool/dbconfig/20220201-171005-root.json [17:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:26] (03PS2) 10Bernard Wang: Turn on wgVectorLanguageAlertInSidebar for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758890 [17:13:49] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:13:56] (03PS3) 10Bernard Wang: Turn on wgVectorLanguageAlertInSidebar for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758890 (https://phabricator.wikimedia.org/T300559) [17:17:59] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2017.codfw.wmnet with OS buster [17:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:51] PROBLEM - BGP status on pfw3-eqiad is CRITICAL: BGP CRITICAL - AS64701/IPv4: Idle - frack-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:19:46] 10SRE-OnFire-Incident-Docs: 20200625-caching-sessions - https://phabricator.wikimedia.org/T260101 (10lmata) 05Open→03Declined [17:19:59] 10SRE-OnFire-Incident-Docs: 20200605-cloud-private-repo - https://phabricator.wikimedia.org/T254750 (10lmata) 05Open→03Declined [17:20:12] 10SRE-OnFire-Incident-Docs: 20200907-mobilefrontend-sec - https://phabricator.wikimedia.org/T268487 (10lmata) 05Open→03Declined [17:20:24] 10SRE-OnFire-Incident-Docs: 20200127-app server latency - https://phabricator.wikimedia.org/T251325 (10lmata) 05Open→03Declined [17:20:44] 10SRE-OnFire-Incident-Docs: 20201006-cloud-vps - https://phabricator.wikimedia.org/T268482 (10lmata) 05Open→03Declined [17:20:46] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM." [homer/public] - 10https://gerrit.wikimedia.org/r/758470 (owner: 10Ayounsi) [17:20:55] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64701/IPv4: Idle - frack-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:21:05] 10SRE-OnFire-Incident-Docs: 20200812-appservers oom - https://phabricator.wikimedia.org/T268481 (10lmata) 05Open→03Declined [17:21:19] 10SRE-OnFire-Incident-Docs: 20200909-mobileapps config change - https://phabricator.wikimedia.org/T268484 (10lmata) 05Open→03Declined [17:21:37] 10SRE-OnFire-Incident-Docs: 20200814-isp-unreachable - https://phabricator.wikimedia.org/T268480 (10lmata) 05Open→03Declined [17:21:48] !log pool cp2039 running envoy as TLS terminator - T271421 [17:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:52] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [17:22:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298558)', diff saved to https://phabricator.wikimedia.org/P19857 and previous config saved to /var/cache/conftool/dbconfig/20220201-172205-marostegui.json [17:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [17:22:09] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [17:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [17:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T298558)', diff saved to https://phabricator.wikimedia.org/P19858 and previous config saved to /var/cache/conftool/dbconfig/20220201-172219-marostegui.json [17:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:39] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez) [17:23:20] (03CR) 10Cathal Mooney: [C: 03+1] "I didn't check every single term to validate it all but did some sanity checks on a few of them and reviewed overall config. LGTM, nice t" [homer/public] - 10https://gerrit.wikimedia.org/r/748111 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [17:23:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298558)', diff saved to https://phabricator.wikimedia.org/P19859 and previous config saved to /var/cache/conftool/dbconfig/20220201-172325-marostegui.json [17:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:32] 10SRE-OnFire-Incident-Docs: 20200204-maps - https://phabricator.wikimedia.org/T251340 (10lmata) 05Open→03Declined closing as this will not get addressed. [17:23:55] (03PS2) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) [17:24:02] 10SRE-OnFire-Incident-Docs: 20200511-thumbor - https://phabricator.wikimedia.org/T254748 (10lmata) 05Open→03Declined closing as this will not get addressed. [17:24:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10Papaul) [17:24:20] (03CR) 10Hashar: [V: 04-1] ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [17:24:33] (03CR) 10jerkins-bot: [V: 04-1] ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [17:24:42] 10SRE-OnFire-Incident-Docs: 20200407-Wikidata's wb_items_per_site table dropped - https://phabricator.wikimedia.org/T251338 (10lmata) 05Open→03Declined closing as this will not get addressed. [17:24:52] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10Papaul) [17:25:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P19860 and previous config saved to /var/cache/conftool/dbconfig/20220201-172509-root.json [17:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:43] 10SRE-OnFire-Incident-Docs: 20200505-wdqs-deploy - https://phabricator.wikimedia.org/T252403 (10lmata) 05Open→03Declined closing as this will not get addressed. [17:25:59] 10SRE-OnFire-Incident-Docs: 20200325-codfw-network - https://phabricator.wikimedia.org/T251337 (10lmata) 05Open→03Declined closing as this will not get addressed. [17:26:13] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet2004-dev.codfw.wmnet with OS bullseye [17:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:24] 10SRE-OnFire-Incident-Docs: 20200319-parsercache - https://phabricator.wikimedia.org/T251336 (10lmata) 05Open→03Declined closing as this will not get addressed. [17:27:07] 10SRE-OnFire-Incident-Docs: 20200225-mediawiki interface language - https://phabricator.wikimedia.org/T251334 (10lmata) 05Open→03Declined closing as this will not get addressed. [17:27:12] (03PS3) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) [17:27:50] 10SRE-OnFire-Incident-Docs: 20200206-mediawiki - https://phabricator.wikimedia.org/T251328 (10lmata) 05Open→03Declined closing as this will not get addressed. [17:27:51] (03CR) 10jerkins-bot: [V: 04-1] ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [17:28:16] 10SRE-OnFire-Incident-Docs: 20200204-app server latency - https://phabricator.wikimedia.org/T251326 (10lmata) 05Open→03Declined closing as this will not get addressed. [17:29:07] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:29:21] 10SRE-OnFire-Incident-Docs: Primary s4 db Incident report review - https://phabricator.wikimedia.org/T281263 (10lmata) 05Open→03Declined closing as this will not get addressed further [17:29:44] !log about to deploy analytics/refinery [17:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:55] RECOVERY - BGP status on pfw3-eqiad is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:30:17] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:30:25] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 80, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:30:31] !log btullis@deploy1002 Started deploy [analytics/refinery@c24f002]: Regular analytics weekly train [analytics/refinery@c24f002] [17:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:53] 10SRE, 10SRE-OnFire-Incident-Docs, 10serviceops, 10Sustainability (Incident Followup): High latency on appservers - https://phabricator.wikimedia.org/T272215 (10lmata) 05Open→03Declined closing this documentation task as it is unlikely the documentation will be completed further [17:31:59] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 113, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:32:04] 10SRE-OnFire-Incident-Docs: 20200207-wikidata - https://phabricator.wikimedia.org/T251332 (10lmata) 05Open→03Declined closing this documentation task as it is unlikely the documentation will be completed further [17:32:24] 10SRE-OnFire-Incident-Docs: 20200902-wdqs-outage - https://phabricator.wikimedia.org/T268488 (10lmata) 05Open→03Declined closing this documentation task as it is unlikely the documentation will be completed further [17:32:39] 10SRE-OnFire-Incident-Docs: 20200908-Rack D hosts outage - https://phabricator.wikimedia.org/T268485 (10lmata) 05Open→03Declined closing this documentation task as it is unlikely the documentation will be completed further [17:32:53] 10SRE-OnFire-Incident-Docs: 20200925-s5-replication-lag - https://phabricator.wikimedia.org/T268483 (10lmata) 05Open→03Declined closing this documentation task as it is unlikely the documentation will be completed further [17:33:53] 10SRE-OnFire-Incident-Docs: 20200723-wdqs-outage - https://phabricator.wikimedia.org/T268479 (10lmata) 05Open→03Declined closing this documentation task as it is unlikely the documentation will be completed further [17:34:01] (03CR) 10Jdlrobson: [C: 03+1] "Note: Some of the bug fixes are still in the train, so make sure you don't backport this any earlier than 2nd Feb PM window. Might be wort" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758890 (https://phabricator.wikimedia.org/T300559) (owner: 10Bernard Wang) [17:35:06] 10SRE-OnFire (FY2021/2022-Q2): 2021-11-04 large file upload timeouts - https://phabricator.wikimedia.org/T299965 (10lmata) [17:35:18] 10SRE-OnFire (FY2021/2022-Q2): 2021-11-04 large file upload timeouts - https://phabricator.wikimedia.org/T299965 (10lmata) a:03lmata [17:36:07] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Papaul) ` request system software add no-copy unlink ` started at 10:38am CT and complete at 10.52am CT ~ 13mins ` request system reboot ` started... [17:38:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P19861 and previous config saved to /var/cache/conftool/dbconfig/20220201-173830-marostegui.json [17:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:13] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Papaul) 05Open→03Resolved a:03Papaul Complete [17:40:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 40%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P19862 and previous config saved to /var/cache/conftool/dbconfig/20220201-174012-root.json [17:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:01] !log btullis@deploy1002 Finished deploy [analytics/refinery@c24f002]: Regular analytics weekly train [analytics/refinery@c24f002] (duration: 11m 29s) [17:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:23] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Papaul) 05Resolved→03Open [17:43:40] (03PS1) 10DCausse: eventgate-main: update image to 2022-02-01-141357-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/758926 (https://phabricator.wikimedia.org/T279541) [17:44:50] (03PS1) 10JHathaway: icinga mobileapps: replace bearnd with mbsantos & jgiannelos [puppet] - 10https://gerrit.wikimedia.org/r/758927 [17:46:17] (03CR) 10MSantos: [C: 03+1] icinga mobileapps: replace bearnd with mbsantos & jgiannelos [puppet] - 10https://gerrit.wikimedia.org/r/758927 (owner: 10JHathaway) [17:47:35] (03PS1) 10Papaul: Add ml-staging200[12] to site.pp with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/758928 (https://phabricator.wikimedia.org/T294946) [17:47:38] (03CR) 10JHathaway: [C: 03+2] icinga mobileapps: replace bearnd with mbsantos & jgiannelos [puppet] - 10https://gerrit.wikimedia.org/r/758927 (owner: 10JHathaway) [17:47:52] !log begin logstash upgrade (eqiad) T299168 [17:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:55] T299168: Upgrade OpenSearch - https://phabricator.wikimedia.org/T299168 [17:48:29] (03PS1) 10Urbanecm: amiwiki: Deploy Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758929 [17:48:40] !log btullis@deploy1002 Started deploy [analytics/refinery@c24f002] (thin): Regular analytics weekly train THIN [analytics/refinery@c24f002] [17:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:41] jouncebot: nowandnext [17:48:41] For the next 0 hour(s) and 11 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220201T1700) [17:48:41] In 0 hour(s) and 11 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220201T1800) [17:48:47] !log btullis@deploy1002 Finished deploy [analytics/refinery@c24f002] (thin): Regular analytics weekly train THIN [analytics/refinery@c24f002] (duration: 00m 07s) [17:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:03] (03PS2) 10Urbanecm: amiwiki: Deploy Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758929 [17:49:03] !log btullis@deploy1002 Started deploy [analytics/refinery@c24f002] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@c24f002] [17:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:09] (03CR) 10Urbanecm: [C: 03+2] amiwiki: Deploy Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758929 (owner: 10Urbanecm) [17:49:46] ottomata: I'm about to refresh eventgate-main with https://gerrit.wikimedia.org/r/758926 if you don't have objections [17:49:52] (03Merged) 10jenkins-bot: amiwiki: Deploy Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758929 (owner: 10Urbanecm) [17:50:07] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php amiwiki growthexperiments [17:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:48] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [17:52:10] (03CR) 10Papaul: [C: 03+2] Add ml-staging200[12] to site.pp with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/758928 (https://phabricator.wikimedia.org/T294946) (owner: 10Papaul) [17:52:32] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/initWikiConfig.php amiwiki [17:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [17:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:30] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Jgreen) [17:53:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P19863 and previous config saved to /var/cache/conftool/dbconfig/20220201-175334-marostegui.json [17:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [17:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [17:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:44] !log btullis@deploy1002 Finished deploy [analytics/refinery@c24f002] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@c24f002] (duration: 05m 41s) [17:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:51] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Jgreen) We're planning to do the upgraded during the Fundraising planned maintenance window of 5/16/2022 to 5/20/2022. [17:55:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P19864 and previous config saved to /var/cache/conftool/dbconfig/20220201-175516-root.json [17:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:56] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 7f8bc6df1ca0856016cd08156654dcb4e388898f: amiwiki: Deploy Growth features in dark mode (1/3) (duration: 00m 51s) [17:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:47] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 7f8bc6df1ca0856016cd08156654dcb4e388898f: amiwiki: Deploy Growth features in dark mode (2/3) (duration: 00m 50s) [17:56:47] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2004-dev.codfw.wmnet with OS bullseye [17:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [17:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:37] !log urbanecm@deploy1002 Synchronized wmf-config/config/amiwiki.yaml: 7f8bc6df1ca0856016cd08156654dcb4e388898f: amiwiki: Deploy Growth features in dark mode (3/3) (duration: 00m 49s) [17:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ml-staging2001.codfw.wmnet with OS buster [17:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-staging2001.codfw.wmnet with O... [17:59:28] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [18:00:05] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220201T1800). Please do the needful. [18:03:06] * urbanecm done with deployment [18:03:45] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2017.codfw.wmnet with OS buster [18:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:08] 10SRE-Access-Requests: Requesting access to deploy-phabricator for brennen - https://phabricator.wikimedia.org/T300658 (10brennen) [18:04:40] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2017.wmnet [18:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:41] 10SRE-Access-Requests, 10Release-Engineering-Team (Radar), 10User-brennen: Requesting access to deploy-phabricator for brennen - https://phabricator.wikimedia.org/T300658 (10brennen) [18:05:22] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10herron) Q2 incident docs on wikitech have been updated to include the summary and scorecard sections from the current incident template [18:08:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298558)', diff saved to https://phabricator.wikimedia.org/P19865 and previous config saved to /var/cache/conftool/dbconfig/20220201-180839-marostegui.json [18:08:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [18:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [18:08:43] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [18:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T298558)', diff saved to https://phabricator.wikimedia.org/P19866 and previous config saved to /var/cache/conftool/dbconfig/20220201-180847-marostegui.json [18:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T298558)', diff saved to https://phabricator.wikimedia.org/P19867 and previous config saved to /var/cache/conftool/dbconfig/20220201-180953-marostegui.json [18:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:07] !log end logstash upgrade (eqiad) T299168 [18:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:10] T299168: Upgrade OpenSearch - https://phabricator.wikimedia.org/T299168 [18:10:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 60%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P19868 and previous config saved to /var/cache/conftool/dbconfig/20220201-181019-root.json [18:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:30] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [18:12:58] 10SRE, 10ops-eqiad: New Cage Config/Testing Eqiad - https://phabricator.wikimedia.org/T300353 (10Cmjohnson) @cmooney a temporary fiber has been run for lsw-e1 xe-0/0/0 -> lsw-e2 xe-0/0/0 [18:13:17] (03PS1) 10Urbanecm: amiwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758930 [18:15:46] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Legoktm) >>! In T292322#7666754, @Joe wrote: > I think this is still way too slow compared to the traditional request time. > > I'll update this task lat... [18:15:56] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-staging2001.codfw.wmnet with OS buster [18:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:05] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-staging2001.codfw.wmnet with OS buster executed with err... [18:16:40] (03PS6) 10JHathaway: [WIP] team-sre: add hardware-related checks [alerts] - 10https://gerrit.wikimedia.org/r/757489 (https://phabricator.wikimedia.org/T294564) (owner: 10Volans) [18:17:54] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [18:18:13] (03PS2) 10DCausse: eventgate-main: update image to 2022-02-01-141357-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/758926 (https://phabricator.wikimedia.org/T279541) [18:19:52] PROBLEM - Host ganeti1019 is DOWN: PING CRITICAL - Packet loss = 100% [18:20:12] going to ship ^ (schema secondary repo bump), let me know if someone has objections [18:21:10] 10SRE-Access-Requests, 10Release-Engineering-Team (Radar), 10User-brennen: Requesting access to deploy-phabricator for brennen - https://phabricator.wikimedia.org/T300658 (10thcipriani) [18:22:11] 10SRE-Access-Requests, 10Release-Engineering-Team (Radar), 10User-brennen: Requesting access to deploy-phabricator for brennen - https://phabricator.wikimedia.org/T300658 (10thcipriani) Approved as sponser/manager. There is no official approver in data.yaml; however, I probably am the right person to be th... [18:23:45] (03CR) 10DCausse: [C: 03+2] eventgate-main: update image to 2022-02-01-141357-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/758926 (https://phabricator.wikimedia.org/T279541) (owner: 10DCausse) [18:24:28] RECOVERY - Host ganeti1019 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [18:24:30] PROBLEM - Check systemd state on ganeti1019 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:24:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P19869 and previous config saved to /var/cache/conftool/dbconfig/20220201-182458-marostegui.json [18:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ml-staging2001.codfw.wmnet with OS buster [18:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:10] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-staging2001.codfw.wmnet with OS buster [18:25:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P19870 and previous config saved to /var/cache/conftool/dbconfig/20220201-182523-root.json [18:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:50] (03PS1) 10Dzahn: admin: add brennen to deploy-phabricator and Tyler as approver [puppet] - 10https://gerrit.wikimedia.org/r/758935 (https://phabricator.wikimedia.org/T300658) [18:26:36] RECOVERY - Check systemd state on ganeti1019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:27:29] (03Merged) 10jenkins-bot: eventgate-main: update image to 2022-02-01-141357-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/758926 (https://phabricator.wikimedia.org/T279541) (owner: 10DCausse) [18:29:03] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply on production [18:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:05] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply on canary [18:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:21] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync on production [18:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:28] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missi [18:30:28] [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [18:30:41] 10SRE, 10ops-eqiad: New Cage Config/Testing Eqiad - https://phabricator.wikimedia.org/T300353 (10cmooney) @Cmjohnson super! Perfect timing too I'm just at the stage I need it. Showing up/up all looking good. [18:33:28] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply on production [18:33:28] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply on canary [18:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:57] PROBLEM - Host ganeti1019 is DOWN: PING CRITICAL - Packet loss = 100% [18:35:47] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync on canary [18:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:04] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync on production [18:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:06] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply on production [18:38:06] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply on canary [18:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:18] (03PS1) 10Accraze: ml-services: update draftquality transformer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/758942 (https://phabricator.wikimedia.org/T298989) [18:38:57] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync on canary [18:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P19871 and previous config saved to /var/cache/conftool/dbconfig/20220201-184002-marostegui.json [18:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:27] 10SRE-Access-Requests: Requesting access to MediaWiki deployment shell for bwang - https://phabricator.wikimedia.org/T300664 (10bwang) [18:40:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P19872 and previous config saved to /var/cache/conftool/dbconfig/20220201-184027-root.json [18:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:31] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync on production [18:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:12] 10SRE-Access-Requests: Requesting access to MediaWiki deployment shell for bwang - https://phabricator.wikimedia.org/T300664 (10SCherukuwada) Manager approves. [18:44:01] RECOVERY - Host ganeti1019 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [18:44:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-staging2001.codfw.wmnet with OS buster [18:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:16] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-staging2001.codfw.wmnet with OS buster completed: - ml-s... [18:45:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ml-staging2002.codfw.wmnet with OS buster [18:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:03] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-staging2002.codfw.wmnet with OS buster [18:47:31] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) [18:47:59] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) @MoritzMuehlenhoff 1006, 1016 and 1019 updated [18:48:11] PROBLEM - Disk space on centrallog1001 is CRITICAL: DISK CRITICAL - free space: /srv 34507 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=centrallog1001&var-datasource=eqiad+prometheus/ops [18:48:53] (03PS2) 10Majavah: openstack: tell dynamicproxy to not edit dns records [puppet] - 10https://gerrit.wikimedia.org/r/756117 (https://phabricator.wikimedia.org/T295246) [18:50:17] PROBLEM - Check systemd state on prometheus1003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-node-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:50:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 (10Papaul) [18:51:18] herron: centrallog now 99% instead of 97% ..even though you merged a change to remove old logs? [18:52:14] mutante: I'm seeing 97 still [18:52:22] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10Papaul) [18:52:42] herron: my bad, it's inodes=99% [18:52:52] mutante: ahh gotcha [18:53:12] still some pruning needed though, may need to start attacking the top producers [18:53:18] well, if you still have those 30G or something then it seems fine.. until that systemd timer will eventually kick in.. right [18:54:22] you could manually start the timer job that puppet created now if needed [18:55:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T298558)', diff saved to https://phabricator.wikimedia.org/P19873 and previous config saved to /var/cache/conftool/dbconfig/20220201-185507-marostegui.json [18:55:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [18:55:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [18:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:14] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [18:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T298558)', diff saved to https://phabricator.wikimedia.org/P19874 and previous config saved to /var/cache/conftool/dbconfig/20220201-185516-marostegui.json [18:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298558)', diff saved to https://phabricator.wikimedia.org/P19875 and previous config saved to /var/cache/conftool/dbconfig/20220201-185622-marostegui.json [18:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:35] mutante: I'll lower the grace period too [19:00:05] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220201T1900) [19:01:24] herron: sounds good! [19:01:43] 10SRE, 10Icinga: Request downtime hosts and servies privileges in Icinga - https://phabricator.wikimedia.org/T300660 (10Aklapper) [19:01:47] (03PS1) 10Herron: rsyslog: lower centrallog1001 old log deletion grace period to 15 days [puppet] - 10https://gerrit.wikimedia.org/r/758945 (https://phabricator.wikimedia.org/T300056) [19:02:57] !log joal@deploy1002 Started deploy [analytics/refinery@6a7983e]: Hotfix analytics weekly train [analytics/refinery@6a7983e] [19:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:20] Starting train prep [19:05:12] 10SRE, 10Icinga: Request downtime hosts and servies privileges in Icinga - https://phabricator.wikimedia.org/T300660 (10ayounsi) Adding Amir to the task as he is clinic duty. [19:05:43] errors look pretty noisy from the start. [19:05:44] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/pcc-worker1002/33538/centrallog1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/758945 (https://phabricator.wikimedia.org/T300056) (owner: 10Herron) [19:10:46] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/758558 (owner: 10Volans) [19:10:53] (03CR) 10Herron: [C: 03+2] rsyslog: lower centrallog1001 old log deletion grace period to 15 days [puppet] - 10https://gerrit.wikimedia.org/r/758945 (https://phabricator.wikimedia.org/T300056) (owner: 10Herron) [19:11:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P19876 and previous config saved to /var/cache/conftool/dbconfig/20220201-191127-marostegui.json [19:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:54] 10SRE, 10ops-eqiad: New Cage Config/Testing Eqiad - https://phabricator.wikimedia.org/T300353 (10Cmjohnson) @cmooney the 2 servers I am racking are an-workers, they will be in the analytics vlan. Does that matter to you? [19:18:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:12] 10SRE, 10ops-eqiad: New Cage Config/Testing Eqiad - https://phabricator.wikimedia.org/T300353 (10cmooney) @Cmjohnson fire away. I can manually move the Vlan if needed. BTW is there a task open for new an-worker servers? There are some questions about the Analytics Vlan and how it'll work with the new setup... [19:19:16] 10SRE, 10ops-eqiad: New Cage Config/Testing Eqiad - https://phabricator.wikimedia.org/T300353 (10Cmjohnson) I see, netbox will not allow me to assign analytics as vlan for that switch [19:19:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-staging2002.codfw.wmnet with OS buster [19:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:49] (03PS1) 10Ahmon Dancy: testwikis wikis to 1.38.0-wmf.20 refs T293961 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758947 [19:19:52] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-staging2002.codfw.wmnet with OS buster completed: - ml-s... [19:19:55] (03CR) 10Ahmon Dancy: [C: 03+2] testwikis wikis to 1.38.0-wmf.20 refs T293961 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758947 (owner: 10Ahmon Dancy) [19:20:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:31] (03Merged) 10jenkins-bot: testwikis wikis to 1.38.0-wmf.20 refs T293961 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758947 (owner: 10Ahmon Dancy) [19:20:32] !log dancy@deploy1002 Started scap: testwikis wikis to 1.38.0-wmf.20 refs T293961 [19:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:35] T293961: 1.38.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T293961 [19:22:07] !log joal@deploy1002 Finished deploy [analytics/refinery@6a7983e]: Hotfix analytics weekly train [analytics/refinery@6a7983e] (duration: 19m 09s) [19:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P19877 and previous config saved to /var/cache/conftool/dbconfig/20220201-192632-marostegui.json [19:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:15] (03CR) 10Dzahn: [C: 03+1] rsyslog: lower centrallog1001 old log deletion grace period to 15 days [puppet] - 10https://gerrit.wikimedia.org/r/758945 (https://phabricator.wikimedia.org/T300056) (owner: 10Herron) [19:28:39] (03CR) 10Ladsgroup: [C: 03+1] prod: READ_NEW for CentralAuth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758443 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah) [19:29:07] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [19:29:36] hmmm [19:29:55] sync-masters is happening right now. [19:30:04] drwx--S--- 2 root wikidev 4096 Feb 1 19:25 .~tmp~ [19:30:09] ah, and the rsync runs as root.. [19:30:21] so it should self-clear at the end of the run when it fixes up ownerhips [19:30:28] 10SRE, 10Icinga, 10User-Ladsgroup: Request downtime hosts and servies privileges in Icinga - https://phabricator.wikimedia.org/T300660 (10Ladsgroup) a:03Ladsgroup Sorry, I will look at this asap. I don't know how should I have seen it (in which dashboard) but this is my first time being clinic duty. [19:35:35] 10SRE, 10ops-codfw: Possible cable issue on restbase2010 management interface - https://phabricator.wikimedia.org/T299426 (10Dzahn) It's highly likely this is like T283582 . We have had this on multiple hosts and they were indeed fixed by firmware upgrade. [19:38:15] (03CR) 10Herron: [C: 03+1] sre: check agent resources too in PuppetFailure [alerts] - 10https://gerrit.wikimedia.org/r/758860 (https://phabricator.wikimedia.org/T299628) (owner: 10Filippo Giunchedi) [19:39:58] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [19:40:40] !log joal@deploy1002 Started deploy [analytics/refinery@6a7983e] (thin): Hotfix analytics weekly train THIN [analytics/refinery@6a7983e] [19:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:47] !log joal@deploy1002 Finished deploy [analytics/refinery@6a7983e] (thin): Hotfix analytics weekly train THIN [analytics/refinery@6a7983e] (duration: 00m 07s) [19:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298558)', diff saved to https://phabricator.wikimedia.org/P19878 and previous config saved to /var/cache/conftool/dbconfig/20220201-194136-marostegui.json [19:41:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [19:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [19:41:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T298558)', diff saved to https://phabricator.wikimedia.org/P19879 and previous config saved to /var/cache/conftool/dbconfig/20220201-194144-marostegui.json [19:41:45] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [19:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298558)', diff saved to https://phabricator.wikimedia.org/P19880 and previous config saved to /var/cache/conftool/dbconfig/20220201-194250-marostegui.json [19:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:37] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10Papaul) [19:47:41] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10Papaul) @elukey all yours leaving the task open since i don't have the Packing Slip to receive the servers in Coupa [19:48:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:18] !log joal@deploy1002 Started deploy [analytics/refinery@6a7983e] (hadoop-test): Hotfix analytics weekly train TEST [analytics/refinery@6a7983e] [19:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:10] !log joal@deploy1002 Finished deploy [analytics/refinery@6a7983e] (hadoop-test): Hotfix analytics weekly train TEST [analytics/refinery@6a7983e] (duration: 05m 51s) [19:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:48] 10SRE, 10DNS, 10Domains, 10Traffic, and 3 others: Project Unseen campaign URL redirect - https://phabricator.wikimedia.org/T300398 (10Varnent) Thank you so very much @Ladsgroup!! :) [19:56:05] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P19881 and previous config saved to /var/cache/conftool/dbconfig/20220201-195755-marostegui.json [19:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] dancy and brennen: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220201T2000). [20:00:23] Rolling out to testwikis now. Will do group0 in about 30 mins [20:00:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:12] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10Cmjohnson) @fgiunchedi I racked ms-fe1012 in the new cage e1. I believe it's going to be used to test the network in the cage for a little bit. Afterward do... [20:03:37] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:05:25] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:15] !log dancy@deploy1002 Finished scap: testwikis wikis to 1.38.0-wmf.20 refs T293961 (duration: 51m 42s) [20:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:18] T293961: 1.38.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T293961 [20:13:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P19882 and previous config saved to /var/cache/conftool/dbconfig/20220201-201259-marostegui.json [20:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:14] 10SRE, 10ops-eqiad: New Cage Config/Testing Eqiad - https://phabricator.wikimedia.org/T300353 (10Cmjohnson) I racked ms-fe1012 in e1 and connected lsw1-e1-eqiad port xe-1/0/7. Updated netbox and ran the provision script without any issues. idrac is setup and reachable. [20:18:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: PS Redundancy for elastic1080.eqiad.wmnet - https://phabricator.wikimedia.org/T300317 (10Cmjohnson) 05Open→03Resolved loose power cables, issue resolved [20:19:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: PS Redundancy for elastic1077.eqiad.wmnet - https://phabricator.wikimedia.org/T300315 (10Cmjohnson) 05Open→03Resolved loose power cables, issue resolved [20:21:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:59] !log dancy@deploy1002 Pruned MediaWiki: 1.38.0-wmf.18 (duration: 04m 08s) [20:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:27:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298558)', diff saved to https://phabricator.wikimedia.org/P19884 and previous config saved to /var/cache/conftool/dbconfig/20220201-202806-marostegui.json [20:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:09] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [20:28:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10Papaul) [20:33:52] okay. [20:33:57] are we ready to do this thing? [20:34:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:19] let's do it [20:35:09] * dancy presses buttons [20:36:14] (03PS1) 10Ahmon Dancy: group0 wikis to 1.38.0-wmf.20 refs T293961 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758958 [20:36:16] (03CR) 10Ahmon Dancy: [C: 03+2] group0 wikis to 1.38.0-wmf.20 refs T293961 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758958 (owner: 10Ahmon Dancy) [20:37:01] (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.20 refs T293961 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758958 (owner: 10Ahmon Dancy) [20:38:23] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.20 refs T293961 [20:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:26] T293961: 1.38.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T293961 [20:39:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:59] !log dancy@deploy1002 Pruned MediaWiki: 1.38.0-wmf.17 (duration: 01m 35s) [20:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:07] (03PS1) 10Ahmon Dancy: logspam: Read log files more efficiently [puppet] - 10https://gerrit.wikimedia.org/r/758962 [20:58:42] (03PS2) 10Ahmon Dancy: logspam: Read log files more efficiently [puppet] - 10https://gerrit.wikimedia.org/r/758962 [21:01:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:19] (03PS1) 10Aklapper: Automate checking MFA status of elevated Phabricator accounts [puppet] - 10https://gerrit.wikimedia.org/r/758963 (https://phabricator.wikimedia.org/T299403) [21:08:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [21:09:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:06] (03PS2) 10Aklapper: Regularly check MFA status of elevated Phabricator accounts [puppet] - 10https://gerrit.wikimedia.org/r/758963 (https://phabricator.wikimedia.org/T299403) [21:12:57] 10SRE, 10Icinga, 10User-Ladsgroup: Request downtime hosts and servies privileges in Icinga - https://phabricator.wikimedia.org/T300660 (10Ladsgroup) Hi @Papaul since you have root, you can do it yourself with the steps given in T300660#7668752. Do you still feel I should do it? [21:13:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [21:14:24] log Deployed patch for T297754 [21:14:34] (03PS1) 10Andrew Bogott: codfw1dev network tests: use cloudservices2002-dev for dns [puppet] - 10https://gerrit.wikimedia.org/r/758964 [21:14:34] !log Deployed patch for T297754 [21:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:09] (03CR) 10Dzahn: Regularly check MFA status of elevated Phabricator accounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758963 (https://phabricator.wikimedia.org/T299403) (owner: 10Aklapper) [21:15:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:05] (03PS2) 10Ladsgroup: admin: add brennen to deploy-phabricator and Tyler as approver [puppet] - 10https://gerrit.wikimedia.org/r/758935 (https://phabricator.wikimedia.org/T300658) (owner: 10Dzahn) [21:16:19] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev network tests: use cloudservices2002-dev for dns [puppet] - 10https://gerrit.wikimedia.org/r/758964 (owner: 10Andrew Bogott) [21:17:34] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: add brennen to deploy-phabricator and Tyler as approver [puppet] - 10https://gerrit.wikimedia.org/r/758935 (https://phabricator.wikimedia.org/T300658) (owner: 10Dzahn) [21:19:00] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Radar), and 2 others: Requesting access to deploy-phabricator for brennen - https://phabricator.wikimedia.org/T300658 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup [21:19:08] (03CR) 10Aklapper: Regularly check MFA status of elevated Phabricator accounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758963 (https://phabricator.wikimedia.org/T299403) (owner: 10Aklapper) [21:20:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:58] 10SRE, 10SRE-Access-Requests, 10User-Ladsgroup: Requesting access to MediaWiki deployment shell for bwang - https://phabricator.wikimedia.org/T300664 (10Ladsgroup) a:03Ladsgroup I will get this done as part of my SRE clinic duty duties. [21:25:35] (03PS3) 10Jforrester: TimedMediaHandler: Make videojs the only player on all group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612348 (https://phabricator.wikimedia.org/T248418) [21:26:07] (03PS3) 10Jforrester: TimedMediaHandler: Make videojs the only player everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612349 (https://phabricator.wikimedia.org/T248418) [21:26:12] (03PS1) 10Ahmon Dancy: fix logspam-watch: sorting by column 6 is broken [puppet] - 10https://gerrit.wikimedia.org/r/758965 (https://phabricator.wikimedia.org/T300298) [21:26:16] (03PS3) 10Jforrester: TimedMediaHandler: Drop Beta Feature, no longer usable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612350 (https://phabricator.wikimedia.org/T248418) [21:26:29] (03PS3) 10Jforrester: TimedMediaHandler: Don't read wmgTmhWebPlayer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612351 (https://phabricator.wikimedia.org/T248418) [21:26:55] (03PS3) 10Jforrester: TimedMediaHandler: Drop pre-switch config, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612352 (https://phabricator.wikimedia.org/T248418) [21:27:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:02] (03PS1) 10Cmjohnson: Adding new ms-fe1090[9]|1[0-2]) to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/758967 (https://phabricator.wikimedia.org/T294137) [21:34:47] (03PS2) 10Cmjohnson: Adding new ms-fe1090[9]|1[0-2]) to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/758967 (https://phabricator.wikimedia.org/T294137) [21:35:23] (03PS1) 10Papaul: Add ml-server200[5-6] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/758968 (https://phabricator.wikimedia.org/T294945) [21:36:49] 10SRE, 10SRE-Access-Requests, 10User-Ladsgroup: Requesting access to MediaWiki deployment shell for bwang - https://phabricator.wikimedia.org/T300664 (10Ladsgroup) [21:37:08] (03CR) 10Cmjohnson: [C: 03+2] Adding new ms-fe1090[9]|1[0-2]) to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/758967 (https://phabricator.wikimedia.org/T294137) (owner: 10Cmjohnson) [21:39:33] (03CR) 10Brennen Bearnes: [C: 03+1] "Code looks good, and after about an hour's testing it seems stable and is a huge speed / usability improvement." [puppet] - 10https://gerrit.wikimedia.org/r/758962 (owner: 10Ahmon Dancy) [21:40:14] 10SRE, 10SRE-Access-Requests, 10User-Ladsgroup: Requesting access to MediaWiki deployment shell for bwang - https://phabricator.wikimedia.org/T300664 (10Ladsgroup) There is no explicit approval for deployment group but IIRC this needs to be approved by @thcipriani. Am I wrong? Isn't it needed anymore? [21:40:45] (03PS3) 10Ahmon Dancy: logspam: Read log files more efficiently [puppet] - 10https://gerrit.wikimedia.org/r/758962 [21:41:04] (03CR) 10Ahmon Dancy: logspam: Read log files more efficiently (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758962 (owner: 10Ahmon Dancy) [21:41:53] (03CR) 10Brennen Bearnes: [C: 03+1] gitlab: parameter for exporters expects Hash but is array by default [puppet] - 10https://gerrit.wikimedia.org/r/758894 (https://phabricator.wikimedia.org/T297411) (owner: 10Dzahn) [21:44:19] 10SRE, 10SRE-Access-Requests, 10User-Ladsgroup: Requesting access to MediaWiki deployment shell for bwang - https://phabricator.wikimedia.org/T300664 (10Dzahn) fwiw, I think we should probably copy the "approval: Tyler" line we have in data.yamk for the "restricted" (subset of deployment) also to the actual... [21:44:36] (03CR) 10Brennen Bearnes: [C: 03+1] "Yep, works." [puppet] - 10https://gerrit.wikimedia.org/r/758965 (https://phabricator.wikimedia.org/T300298) (owner: 10Ahmon Dancy) [21:50:11] (03PS1) 10Cwhite: apifeatureusage: increase logstash heap memory to 2G [puppet] - 10https://gerrit.wikimedia.org/r/758970 (https://phabricator.wikimedia.org/T297239) [21:53:26] (03CR) 10Cwhite: "https://puppet-compiler.wmflabs.org/pcc-worker1003/33539/" [puppet] - 10https://gerrit.wikimedia.org/r/758970 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [21:54:18] (03PS4) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) [21:55:19] (03CR) 10jerkins-bot: [V: 04-1] ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [21:55:59] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet2002-dev.codfw.wmnet with OS bullseye [21:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:25] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:57:46] (03PS2) 10Papaul: Add ml-server200[5-6] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/758968 (https://phabricator.wikimedia.org/T294945) [21:58:20] (03Abandoned) 10Papaul: Add ml-server200[5-6] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/758968 (https://phabricator.wikimedia.org/T294945) (owner: 10Papaul) [21:58:25] (03Restored) 10Papaul: Add ml-server200[5-6] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/758968 (https://phabricator.wikimedia.org/T294945) (owner: 10Papaul) [22:01:26] (03Abandoned) 10Papaul: Add ml-server200[5-6] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/758968 (https://phabricator.wikimedia.org/T294945) (owner: 10Papaul) [22:01:36] (03Restored) 10Papaul: Add ml-server200[5-6] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/758968 (https://phabricator.wikimedia.org/T294945) (owner: 10Papaul) [22:03:24] (03PS5) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) [22:04:03] (03CR) 10jerkins-bot: [V: 04-1] ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [22:05:12] (03PS2) 10Cwhite: wikimedia.org: add grafana-next-rw [dns] - 10https://gerrit.wikimedia.org/r/757780 (https://phabricator.wikimedia.org/T282863) [22:05:49] (03PS3) 10Cwhite: wikimedia.org: add grafana-next-rw [dns] - 10https://gerrit.wikimedia.org/r/757780 (https://phabricator.wikimedia.org/T282863) [22:07:05] (03CR) 10Hashar: "This is intended to create a virtual image which is used to test Timo "fresh" project and the mw-cli utility." [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [22:09:18] (03CR) 10Cwhite: "Someone from traffic need to look at this as well?" [puppet] - 10https://gerrit.wikimedia.org/r/757778 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [22:09:43] (03CR) 10Ryan Kemper: [C: 03+2] elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [22:10:50] (03CR) 10Cwhite: [C: 03+1] sre: check agent resources too in PuppetFailure [alerts] - 10https://gerrit.wikimedia.org/r/758860 (https://phabricator.wikimedia.org/T299628) (owner: 10Filippo Giunchedi) [22:11:26] 10SRE, 10SRE-Access-Requests, 10Analytics, 10Patch-For-Review, 10User-Ladsgroup: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10DannyH) I'm not sure how to check this. On Superset, my profile is https://superset.wikimedia.org/superset/profile/dannyh/ To log in, I... [22:12:09] (03CR) 10Dave Pifke: [C: 03+1] "Thanks for automating this! Overall, looks good." [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [22:13:13] (03PS1) 10Ryan Kemper: Revert "elastic: install elasticsearch-oss from component" [puppet] - 10https://gerrit.wikimedia.org/r/758908 [22:13:21] (03CR) 10Ryan Kemper: "Notice: /Stage[main]/Elasticsearch::Curator/Apt::Package_from_component[elasticsearch-curator]/Exec[exec_apt_elasticsearch-curator]/return" [puppet] - 10https://gerrit.wikimedia.org/r/758908 (owner: 10Ryan Kemper) [22:13:53] (03PS3) 10Dzahn: Add ml-server200[5-6] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/758968 (https://phabricator.wikimedia.org/T294945) (owner: 10Papaul) [22:14:47] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (7) node(s) change every puppet run: miscweb1002, orespoolcounter2003, orespoolcounter2004, an-test-client1001, orespoolcounter1004, cloudmetrics1004, orespoolcounter1003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [22:15:09] (03CR) 10Papaul: [C: 03+2] Add ml-server200[5-6] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/758968 (https://phabricator.wikimedia.org/T294945) (owner: 10Papaul) [22:15:36] (03CR) 10Ryan Kemper: "Puppet failure with the elasticsearch-curator stuff: https://phabricator.wikimedia.org/P19885" [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [22:16:06] (03CR) 10Ryan Kemper: [C: 03+2] Revert "elastic: install elasticsearch-oss from component" [puppet] - 10https://gerrit.wikimedia.org/r/758908 (owner: 10Ryan Kemper) [22:18:31] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01189 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [22:19:21] "widespread" made me look on miscweb1002 [22:19:30] but it's not broken (now) [22:21:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve2005.codfw.wmnet with OS buster [22:21:05] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ml-serve2005.codfw.wmnet with OS buster [22:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:15] (03PS1) 10Ryan Kemper: Revert "Revert "elastic: install elasticsearch-oss from component"" [puppet] - 10https://gerrit.wikimedia.org/r/758909 [22:22:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve2005.codfw.wmnet with OS buster [22:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-serve2005.codfw.wmnet with OS b... [22:24:05] (03PS2) 10Ryan Kemper: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/758909 [22:24:42] 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team (Radar), 10User-Ladsgroup, 10User-brennen: Requesting access to deploy-phabricator for brennen - https://phabricator.wikimedia.org/T300658 (10Dzahn) >>! In T300658#7668604, @thcipriani wrote: > There is no official approver in data.yaml; however, I... [22:24:54] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/758909 (owner: 10Ryan Kemper) [22:25:13] (03Abandoned) 10Ryan Kemper: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753983 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [22:26:16] (03PS3) 10Ryan Kemper: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/758909 [22:30:21] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/33540/gitlab1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/758894 (https://phabricator.wikimedia.org/T297411) (owner: 10Dzahn) [22:32:10] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33542/console" [puppet] - 10https://gerrit.wikimedia.org/r/758909 (owner: 10Ryan Kemper) [22:33:45] (03PS4) 10Ryan Kemper: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/758909 [22:34:24] (03CR) 10Bking: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33543/console" [puppet] - 10https://gerrit.wikimedia.org/r/758909 (owner: 10Ryan Kemper) [22:34:41] (03CR) 10Dzahn: "the "role::gitlab" class applied on gitlab-prod-1001.devtools now DOES NOT FAIL anymore. as in "puppet agent finishes". we have some more " [puppet] - 10https://gerrit.wikimedia.org/r/758894 (https://phabricator.wikimedia.org/T297411) (owner: 10Dzahn) [22:36:11] (03CR) 10Ryan Kemper: [C: 03+2] elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/758909 (owner: 10Ryan Kemper) [22:37:05] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/758973 [22:42:54] (03CR) 10Andrew Bogott: "Asked on IRC as well: Is this meant to be a transitional feature, supporting the in-between state after you add dns-handling to the API e" [puppet] - 10https://gerrit.wikimedia.org/r/756117 (https://phabricator.wikimedia.org/T295246) (owner: 10Majavah) [22:48:11] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet2002-dev.codfw.wmnet with OS bullseye [22:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2005.codfw.wmnet with OS buster [22:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-serve2005.codfw.wmnet with OS buste... [22:55:38] (03PS3) 10Ebernhardson: rdf-streaming-updater: add the reconciliation stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753788 (https://phabricator.wikimedia.org/T279541) (owner: 10DCausse) [23:09:37] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.001622 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [23:29:18] (03PS4) 10Zabe: Start writing to some wmg* constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734572 (https://phabricator.wikimedia.org/T45956) [23:32:53] (03PS2) 10Zabe: Migrate calls of wmf* constants to wmg* constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734573 (https://phabricator.wikimedia.org/T45956) [23:37:51] (03CR) 10Ahmon Dancy: [C: 04-1] ci: Qemu image and snapshot creation (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [23:37:55] (03PS2) 10Zabe: Consistently write to $wmgRealm the same value as to $wmfRealm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734582 (https://phabricator.wikimedia.org/T45956) [23:41:38] (03PS1) 10Ottomata: Actually unset env vars that are activated by conda/activate.d/env_vars.sh [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/758983 (https://phabricator.wikimedia.org/T292699) [23:43:58] (03PS2) 10Ottomata: Actually unset env vars that are activated by conda/activate.d/env_vars.sh [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/758983 (https://phabricator.wikimedia.org/T292699) [23:52:52] (03PS1) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for buster [puppet] - 10https://gerrit.wikimedia.org/r/758986 (https://phabricator.wikimedia.org/T300682) [23:52:54] (03PS1) 10Dduvall: contint: Install docker 20.10 from thirdparty/ci on buster [puppet] - 10https://gerrit.wikimedia.org/r/758987 (https://phabricator.wikimedia.org/T300682) [23:53:09] PROBLEM - Disk space on centrallog1001 is CRITICAL: DISK CRITICAL - free space: /srv 34100 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=centrallog1001&var-datasource=eqiad+prometheus/ops [23:53:48] (03CR) 10jerkins-bot: [V: 04-1] contint: Install docker 20.10 from thirdparty/ci on buster [puppet] - 10https://gerrit.wikimedia.org/r/758987 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [23:56:04] (03CR) 10Dduvall: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/758987 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [23:56:51] (03CR) 10jerkins-bot: [V: 04-1] contint: Install docker 20.10 from thirdparty/ci on buster [puppet] - 10https://gerrit.wikimedia.org/r/758987 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [23:57:54] (03PS2) 10Dduvall: contint: Install docker 20.10 from thirdparty/ci on buster [puppet] - 10https://gerrit.wikimedia.org/r/758987 (https://phabricator.wikimedia.org/T300682) [23:59:29] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook