[00:08:59] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.83`. Pre-deploy tests passing on canary `wdqs1003` [00:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:27] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@da9efa9]: 0.3.83 [00:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:08] !log [WDQS Deploy] Tests passing following deploy of `0.3.83` on canary `wdqs1003`; proceeding to rest of fleet [00:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:31] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@da9efa9]: 0.3.83 (duration: 07m 05s) [00:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:16] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [00:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:19] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [00:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:24] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [00:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:52] (03CR) 10Krinkle: [C: 03+1] debug.json: List primary DC servers first [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714418 (https://phabricator.wikimedia.org/T289246) (owner: 10Legoktm) [00:42:58] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:44] (03CR) 10Bstorm: toolforge: remove portgrabber (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714187 (owner: 10Majavah) [02:00:04] Deploy window Branching MediaWiki, extensions, skins, and vendor – See Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210824T0200) [02:06:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.20 [core] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714446 [02:06:48] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.20 [core] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714446 (owner: 10TrainBranchBot) [02:08:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:56] (03CR) 10jerkins-bot: [V: 04-1] Branch commit for wmf/1.37.0-wmf.20 [core] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714446 (owner: 10TrainBranchBot) [03:05:30] 10SRE, 10WM-Bot: wm-bot doesn't set charset=utf-8, which breaks (amongst other things) emoji rendering - https://phabricator.wikimedia.org/T250104 (10MacFan4000) [03:25:57] PROBLEM - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:27:43] RECOVERY - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.016 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:29:00] here, looking [03:46:36] !log wdqs1012 restarted prometheus-blazegraph-exporter-wdqs-blazegraph.service and prometheus-blazegraph-exporter-wdqs-categories.service after apparent exceptions/crashes [03:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:50] !log rzl@wdqs1012:~$ sudo depool [03:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:02] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:55:22] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:18:57] 10SRE, 10LDAP-Access-Requests: LDAP Access to wmf user group for TAndic - https://phabricator.wikimedia.org/T288527 (10Dzahn) We received automatic email telling us that " tandic is in nda group, but registered with a WMF account". @TAndic I removed you from "nda" but added you to "wmf". This should make no... [05:49:53] (03CR) 10Dzahn: [C: 03+1] "looks and compiles fine: https://puppet-compiler.wmflabs.org/compiler1001/30790/" [puppet] - 10https://gerrit.wikimedia.org/r/714386 (owner: 10Effie Mouzeli) [05:54:43] (03CR) 10Dzahn: "please also see the comments on https://gerrit.wikimedia.org/r/c/operations/puppet/+/377721 can we ensure there are not going to be tilde " [puppet] - 10https://gerrit.wikimedia.org/r/714414 (owner: 10Ahmon Dancy) [05:55:55] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mc2037.codfw.wmnet'... [05:56:46] (03CR) 10Dzahn: "The reason to do this is to avoid getting the "you are using minikube" message after deploying." [deployment-charts] - 10https://gerrit.wikimedia.org/r/714034 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [06:14:58] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2037.codfw.wmnet with reason: REIMAGE [06:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:14] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2037.codfw.wmnet with reason: REIMAGE [06:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:23] 10SRE, 10Analytics, 10Patch-For-Review: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10elukey) @BTullis we have been doing it manually for the stat100x boxes so far, nothing on puppet! [06:28:08] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2037.codfw.wmnet'] ` and were **ALL** successful. [06:30:11] (03PS1) 10Marostegui: mariadb: Update version to 10.4.21 [software] - 10https://gerrit.wikimedia.org/r/714454 [06:31:15] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) [06:31:26] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) [06:32:18] (03PS2) 10Filippo Giunchedi: o11y: port thanos-compact alerts from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/714372 (https://phabricator.wikimedia.org/T288726) [06:32:20] (03CR) 10Filippo Giunchedi: o11y: port thanos-compact alerts from Icinga (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/714372 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [06:32:23] (03CR) 10Marostegui: [C: 03+2] mariadb: Update version to 10.4.21 [software] - 10https://gerrit.wikimedia.org/r/714454 (owner: 10Marostegui) [06:32:54] (03Merged) 10jenkins-bot: mariadb: Update version to 10.4.21 [software] - 10https://gerrit.wikimedia.org/r/714454 (owner: 10Marostegui) [06:35:14] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: port thanos-compact alerts from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/714372 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [06:35:19] (03PS3) 10Filippo Giunchedi: o11y: port thanos-compact alerts from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/714372 (https://phabricator.wikimedia.org/T288726) [06:35:31] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: remove thanos-compact alerts, ported to alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/714373 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [06:40:25] (03CR) 10Ayounsi: [C: 03+1] reports/puppetdb: support WMF standard configs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/714377 (https://phabricator.wikimedia.org/T284614) (owner: 10Volans) [06:48:41] (03CR) 10Elukey: [C: 03+1] "Post-merge check :)" [puppet] - 10https://gerrit.wikimedia.org/r/714331 (https://phabricator.wikimedia.org/T276239) (owner: 10Btullis) [06:51:09] (03CR) 10Elukey: "LGTM - just to confirm, have the namenodes been restarted?" [puppet] - 10https://gerrit.wikimedia.org/r/714369 (https://phabricator.wikimedia.org/T275767) (owner: 10Btullis) [06:55:05] (03CR) 10Volans: "inline reply" [puppet] - 10https://gerrit.wikimedia.org/r/714137 (https://phabricator.wikimedia.org/T289202) (owner: 10RLazarus) [06:55:38] (03CR) 10Volans: [C: 03+2] reports/puppetdb: support WMF standard configs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/714377 (https://phabricator.wikimedia.org/T284614) (owner: 10Volans) [06:56:23] (03Merged) 10jenkins-bot: reports/puppetdb: support WMF standard configs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/714377 (https://phabricator.wikimedia.org/T284614) (owner: 10Volans) [06:57:36] (03CR) 10Dzahn: osm: migrate cron osm_sync_lag to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [06:58:05] (03CR) 10Dzahn: "this is used only in cloud, it needs someone involved in the relevant cloud project to deploy this" [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [06:59:13] (03CR) 10Dzahn: "ah, nevermind, of course osm_master is used in prod, but let's add some people working on maps*" [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:01:02] (03CR) 10Filippo Giunchedi: logstash: route alertmanager alerts to logstash alerts index (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714374 (https://phabricator.wikimedia.org/T289356) (owner: 10Herron) [07:01:51] (03CR) 10JMeybohm: [C: 03+1] miscweb: set service.deployment to production, not minikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/714034 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [07:06:28] 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10SRE Observability (FY2021/2022-Q1): Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10Volans) Thanks for the fix. [07:06:34] (03CR) 10Dzahn: [C: 03+2] miscweb: set service.deployment to production, not minikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/714034 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [07:06:44] (03CR) 10jerkins-bot: [V: 04-1] miscweb: set service.deployment to production, not minikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/714034 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [07:08:12] (03PS4) 10Dzahn: miscweb: set service.deployment to production, not minikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/714034 (https://phabricator.wikimedia.org/T255148) [07:10:43] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack downtime methods fail when the admin reason includes an apostrophe - https://phabricator.wikimedia.org/T288558 (10Volans) a:05RLazarus→03Volans Ack, I'll make a new release in the next few days, claiming the task. [07:13:01] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/714034 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [07:15:34] !log Optimize huwiki.flaggedtemplates on db1098:3317 [07:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:52] (03CR) 10JMeybohm: "Currently running another PCC on `P:tlsproxy::envoy` just to be sure - as there are so many roles using this." [puppet] - 10https://gerrit.wikimedia.org/r/710507 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [07:17:26] !log Optimize huwiki.flaggedtemplates on db1127 [07:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:11] (03CR) 10Dzahn: [C: 03+2] miscweb: set service.deployment to production, not minikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/714034 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [07:19:15] (03PS5) 10Dzahn: miscweb: set service.deployment to production, not minikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/714034 (https://phabricator.wikimedia.org/T255148) [07:20:09] (03PS3) 10JMeybohm: k8s::apiserver: Add admission controller config file [puppet] - 10https://gerrit.wikimedia.org/r/714071 (https://phabricator.wikimedia.org/T289131) [07:20:51] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30795/console" [puppet] - 10https://gerrit.wikimedia.org/r/714071 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [07:26:59] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, and 2 others: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10Ladsgroup) [07:27:02] 10SRE, 10serviceops, 10Patch-For-Review: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10Ladsgroup) [07:27:26] 10Puppet, 10Infrastructure-Foundations, 10Wikidata, 10wdwb-tech, 10User-Ladsgroup: Migrate wikibase-dispatch-changes crons to systemd timers - https://phabricator.wikimedia.org/T288175 (10Ladsgroup) 05Open→03Resolved It's done \o/ [07:29:34] !log restarting blazegraph on wdqs1012 [07:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:54] (03PS1) 10Dzahn: replace separate httpd configs for stating/test with links to prod [container/miscweb] - 10https://gerrit.wikimedia.org/r/714458 [07:30:53] (03PS2) 10Dzahn: replace separate httpd configs for stating/test with links to prod [container/miscweb] - 10https://gerrit.wikimedia.org/r/714458 (https://phabricator.wikimedia.org/T255148) [07:32:21] (03PS1) 10Dzahn: static-bugzilla: uncomment rewrite config line [container/miscweb] - 10https://gerrit.wikimedia.org/r/714459 (https://phabricator.wikimedia.org/T255148) [07:38:29] (03CR) 10David Caro: "This can probably use wmcs-backups and the libs there instead of running bare commands, that gives some control on handling the backups th" [puppet] - 10https://gerrit.wikimedia.org/r/714428 (owner: 10Andrew Bogott) [07:43:47] (03CR) 10Dzahn: [C: 03+2] replace separate httpd configs for stating/test with links to prod [container/miscweb] - 10https://gerrit.wikimedia.org/r/714458 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [07:43:53] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30796/console" [puppet] - 10https://gerrit.wikimedia.org/r/714071 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [07:44:48] (03Merged) 10jenkins-bot: replace separate httpd configs for stating/test with links to prod [container/miscweb] - 10https://gerrit.wikimedia.org/r/714458 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [07:46:17] (03PS2) 10Dzahn: static-bugzilla: uncomment rewrite config line [container/miscweb] - 10https://gerrit.wikimedia.org/r/714459 (https://phabricator.wikimedia.org/T255148) [07:47:10] (03PS4) 10JMeybohm: k8s::apiserver: Add admission controller config file [puppet] - 10https://gerrit.wikimedia.org/r/714071 (https://phabricator.wikimedia.org/T289131) [07:47:12] (03CR) 10Dzahn: [C: 03+2] static-bugzilla: uncomment rewrite config line [container/miscweb] - 10https://gerrit.wikimedia.org/r/714459 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [07:47:47] 10SRE, 10SRE-Access-Requests, 10Gerrit, 10LDAP-Access-Requests: Add dancy to `gerritadmin` LDAP group - https://phabricator.wikimedia.org/T289537 (10jcrespo) a:03jcrespo [07:47:56] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30797/console" [puppet] - 10https://gerrit.wikimedia.org/r/714071 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [07:48:31] (03Merged) 10jenkins-bot: static-bugzilla: uncomment rewrite config line [container/miscweb] - 10https://gerrit.wikimedia.org/r/714459 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [07:51:35] (03CR) 10Dzahn: [C: 03+2] miscweb: set service.deployment to production, not minikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/714034 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [07:51:47] !log repool wdqs1012 T289551 [07:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:52] T289551: wdqs1012 flatlined after page for wdqs.svc.eqiad.wmnet timing out - https://phabricator.wikimedia.org/T289551 [07:54:02] (03Merged) 10jenkins-bot: miscweb: set service.deployment to production, not minikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/714034 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [07:58:11] 10SRE, 10SRE-Access-Requests, 10Gerrit, 10LDAP-Access-Requests: Add dancy to `gerritadmin` LDAP group - https://phabricator.wikimedia.org/T289537 (10jcrespo) [08:00:10] 10SRE, 10SRE-Access-Requests, 10Gerrit, 10LDAP-Access-Requests: Add dancy to `gerritadmin` LDAP group - https://phabricator.wikimedia.org/T289537 (10jcrespo) p:05Triage→03High [08:01:06] !log temp fix thanos-swift.discovery.wmnet in /etc/hosts to get swift-dispersion-stats to work - T283714 [08:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:11] T283714: Python 3's eventlet.green getaddrinfo timeout in Bullseye - https://phabricator.wikimedia.org/T283714 [08:03:45] (03PS1) 10Dzahn: static-bugzilla: add uncompressed HTML for the first 1000 bugs [container/miscweb] - 10https://gerrit.wikimedia.org/r/714460 (https://phabricator.wikimedia.org/T255148) [08:06:00] 10SRE, 10SRE-Access-Requests, 10Gerrit, 10LDAP-Access-Requests: Add dancy to `gerritadmin` LDAP group - https://phabricator.wikimedia.org/T289537 (10jcrespo) 05Open→03Resolved This is done on LDAP https://ldap.toolforge.org/user/dancy Please note that we have in our notes "The ldap_groups Gerrit cache... [08:06:30] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:08:33] (03PS2) 10Dzahn: static-bugzilla: add uncompressed HTML for the first 1000 bugs [container/miscweb] - 10https://gerrit.wikimedia.org/r/714460 (https://phabricator.wikimedia.org/T255148) [08:12:54] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10jcrespo) Hi, @odimitrijevic, this is a friendly reminder that this request (and other 2) are pending on your approval. Please t... [08:14:31] (03CR) 10Elukey: "Trying to add some comments in here to speed up the review of the patch, sorry again for the late answers. This codebase is not touched of" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/708094 (owner: 10R4q3NWnUx2CEhVyr) [08:19:35] (03CR) 10Jelto: "I tested the change on the replica gitlab2001 and login using still works." [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/714382 (https://phabricator.wikimedia.org/T288392) (owner: 10Jbond) [08:22:47] (03PS5) 10JMeybohm: k8s::apiserver: Add admission controller config file [puppet] - 10https://gerrit.wikimedia.org/r/714071 (https://phabricator.wikimedia.org/T289131) [08:23:27] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30798/console" [puppet] - 10https://gerrit.wikimedia.org/r/714071 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [08:26:48] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] debug.json: List primary DC servers first [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714418 (https://phabricator.wikimedia.org/T289246) (owner: 10Legoktm) [08:28:50] (03PS6) 10JMeybohm: k8s::apiserver: Add admission controller config file [puppet] - 10https://gerrit.wikimedia.org/r/714071 (https://phabricator.wikimedia.org/T289131) [08:29:43] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30799/console" [puppet] - 10https://gerrit.wikimedia.org/r/714071 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [08:39:16] mutante: o/ I see some gerrit notifications in T255148, not sure if it is the right task or not [08:39:16] T255148: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 [08:39:21] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata/hosts.pp: eqiad memcached refresh cleanup [puppet] - 10https://gerrit.wikimedia.org/r/714050 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [08:43:28] elukey: oooh, no that is not the right task. that was an accident. sorry and thanks for pointing it out [08:43:46] mutante: np! I am subscribed to it and I was confused, this is why I asked :) [08:44:31] (03PS3) 10Dzahn: static-bugzilla: add uncompressed HTML for the first 1000 bugs [container/miscweb] - 10https://gerrit.wikimedia.org/r/714460 (https://phabricator.wikimedia.org/T281538) [08:48:11] elukey: cant fix the commit messages because they are already merged and can't delete the phab comments because it asks for 2fa and says my code is invalid :p [08:49:18] oh well.. no more new ones at least now [08:49:27] be back later [08:50:14] nono it is fine for the prev comments, no problem :) [08:51:26] 10SRE, 10Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929 (10ayounsi) @cmooney That looks cleaner indeed. @faidon We're moving away from 172.16/12 IPs being able to reach the Wikis, which means VM traffic needs to be NATed and looses useful troubleshooting... [08:51:53] !log start of mwscript extensions/FlaggedRevs/maintenance/pruneRevData.php --wiki=arwiki --prune --batch-size=5 --sleep=5 (T289249) [08:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:58] T289249: flaggedtemplates table should not keep the whole history of all revisions - https://phabricator.wikimedia.org/T289249 [08:54:00] (03CR) 10Jbond: [C: 04-1] gitlab cas: update uid field to use uid not CN (031 comment) [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/714382 (https://phabricator.wikimedia.org/T288392) (owner: 10Jbond) [08:54:20] !log start of mwscript extensions/FlaggedRevs/maintenance/pruneRevData.php --wiki=dewiki --prune --batch-size=5 --sleep=5 (T289249) [08:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:56] (03CR) 10Btullis: Add six worker nodes that were staged into service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714369 (https://phabricator.wikimedia.org/T275767) (owner: 10Btullis) [08:56:50] (03PS1) 10Elukey: kubeflow-kfserving: add missing comma [deployment-charts] - 10https://gerrit.wikimedia.org/r/714534 (https://phabricator.wikimedia.org/T272919) [08:57:29] (03CR) 10Elukey: [C: 03+1] Add six worker nodes that were staged into service [puppet] - 10https://gerrit.wikimedia.org/r/714369 (https://phabricator.wikimedia.org/T275767) (owner: 10Btullis) [08:57:53] (03CR) 10JMeybohm: envoyproxy: Support V3 configuration API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710495 (https://phabricator.wikimedia.org/T265880) (owner: 10Vgutierrez) [08:58:18] (03CR) 10Jbond: [C: 03+2] Removed verify=False satement from requests session constructor that had been present during initial testing. [software/statograph] - 10https://gerrit.wikimedia.org/r/709028 (owner: 10Cathal Mooney) [08:58:47] 10SRE, 10Traffic, 10serviceops: Unexpected upload speed to commons - https://phabricator.wikimedia.org/T288481 (10jcrespo) Hey, @aborrero, I cannot speak on behalf of the traffic/netops/serviceops team, but given that large files -IIRC- have a different workflow than smaller ones (multi-part upload) and spe... [09:00:16] (03CR) 10Elukey: [C: 03+2] kubeflow-kfserving: add missing comma [deployment-charts] - 10https://gerrit.wikimedia.org/r/714534 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [09:01:00] 10SRE, 10Traffic, 10serviceops, 10Wikimedia-production-error: Unexpected upload speed to commons - https://phabricator.wikimedia.org/T288481 (10Majavah) [09:01:09] (03PS1) 10Jbond: 0.1.1: prepare release [software/statograph] - 10https://gerrit.wikimedia.org/r/714535 [09:02:27] (03CR) 10Jbond: [C: 03+2] 0.1.1: prepare release [software/statograph] - 10https://gerrit.wikimedia.org/r/714535 (owner: 10Jbond) [09:02:32] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:36] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:43] (03CR) 10Effie Mouzeli: [C: 03+2] mcrouter: add option to remove certificatre expiring alerts [puppet] - 10https://gerrit.wikimedia.org/r/714386 (owner: 10Effie Mouzeli) [09:03:56] (03CR) 10Effie Mouzeli: [C: 03+2] "Good for mwmaint2002 too https://puppet-compiler.wmflabs.org/compiler1003/30792/" [puppet] - 10https://gerrit.wikimedia.org/r/714386 (owner: 10Effie Mouzeli) [09:04:06] (03CR) 10JMeybohm: [C: 03+1] "This LGTM but I think it would be nice to initially disable puppet (at least on mw*) and double check on some hosts that this does not bre" [puppet] - 10https://gerrit.wikimedia.org/r/710507 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [09:06:41] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:08:03] !log upload new statograph version [09:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:25] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10Epic, 10Goal: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) [09:13:59] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Puppetize media backups infrastructure - https://phabricator.wikimedia.org/T276442 (10jcrespo) 05Open→03Resolved I am going to consider this as "done", but obviously, further changes will be required in the future- but the basic services are... [09:32:29] (03PS1) 10Filippo Giunchedi: profile: remove thanos alerts, moved to alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/714541 (https://phabricator.wikimedia.org/T288726) [09:32:47] (03CR) 10jerkins-bot: [V: 04-1] profile: remove thanos alerts, moved to alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/714541 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [09:33:00] (03PS1) 10Filippo Giunchedi: o11y: add alerts ported from icinga/upstream [alerts] - 10https://gerrit.wikimedia.org/r/714543 (https://phabricator.wikimedia.org/T288726) [09:33:40] (03PS2) 10Filippo Giunchedi: profile: remove thanos alerts, moved to alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/714541 (https://phabricator.wikimedia.org/T288726) [09:35:28] (03PS2) 10Btullis: Add six worker nodes that were staged into service [puppet] - 10https://gerrit.wikimedia.org/r/714369 (https://phabricator.wikimedia.org/T275767) [09:37:28] (03CR) 10Btullis: [C: 03+2] Add six worker nodes that were staged into service [puppet] - 10https://gerrit.wikimedia.org/r/714369 (https://phabricator.wikimedia.org/T275767) (owner: 10Btullis) [09:46:29] (03CR) 10Filippo Giunchedi: "It occurred to me that we should be adding comments/disclaimers to this function, specifically on the lifecycle of resources in puppetdb. " [puppet] - 10https://gerrit.wikimedia.org/r/692286 (https://phabricator.wikimedia.org/T282880) (owner: 10Jbond) [09:58:23] (03CR) 10Marostegui: [C: 03+1] bernard: Add basic tox config [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [09:58:59] (03PS1) 10Kosta Harlan: GrowthExperiments: Switch image recommendations flag off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714548 (https://phabricator.wikimedia.org/T288797) [09:59:18] (03PS2) 10Kosta Harlan: GrowthExperiments: Switch image recommendations flag off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714548 (https://phabricator.wikimedia.org/T288797) [10:00:00] (03PS1) 10Kosta Harlan: [labs] GrowthExperiments: Switch image recommendations flag on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714549 [10:02:10] (03CR) 10JMeybohm: [C: 04-1] envoyproxy: Support ciphersuite configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/710577 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [10:07:51] (03CR) 10JMeybohm: "This can/should be merged with Id5e492332de6990eb4a7f40e89755165550d276f I guess" [puppet] - 10https://gerrit.wikimedia.org/r/710581 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [10:12:12] (03CR) 10JMeybohm: [C: 04-1] envoyproxy: Add STEK configuration support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711399 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [10:27:18] (03CR) 10Vgutierrez: envoyproxy: Support ciphersuite configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710577 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [10:27:36] (03PS1) 10Jbond: puppet-facts-export: ensure we have facts before continuing to process [puppet] - 10https://gerrit.wikimedia.org/r/714551 (https://phabricator.wikimedia.org/T289335) [10:27:57] 10Puppet, 10Infrastructure-Foundations, 10puppet-compiler, 10Patch-For-Review: puppet-facts-export sometimes fails with 'trusted' fact not found - https://phabricator.wikimedia.org/T289335 (10jbond) a:03jbond I had a quick look at the code and there is a potential race condition. We first get a list of... [10:28:37] (03CR) 10Jbond: [C: 03+2] puppet-facts-export: ensure we have facts before continuing to process [puppet] - 10https://gerrit.wikimedia.org/r/714551 (https://phabricator.wikimedia.org/T289335) (owner: 10Jbond) [10:36:20] (03PS8) 10Vgutierrez: envoyproxy: Support ciphersuite configuration [puppet] - 10https://gerrit.wikimedia.org/r/710577 (https://phabricator.wikimedia.org/T271421) [10:36:48] (03CR) 10Vgutierrez: envoyproxy: Support ciphersuite configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710577 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [10:44:48] (03CR) 10Vgutierrez: varnish: Containerize varnish test environment (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [10:52:26] (03CR) 10Jcrespo: [C: 03+2] bernard: Add basic tox config [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Time to snap out of that daydream and deploy European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210824T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:16] \o/ [11:05:12] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mc2029.codfw.wmnet'... [11:19:38] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/714038 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [11:21:21] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10Elitre) @Trizek-WMF @Quiddity could you please verify whether the "two weeks before" tasks have been completed? If so, please mark them do... [11:24:45] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2029.codfw.wmnet with reason: REIMAGE [11:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:03] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mc2029.codfw.wmnet with reason: REIMAGE [11:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:49] PROBLEM - Memcached on mc2029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Memcached [11:35:25] PROBLEM - Host mc2029 is DOWN: PING CRITICAL - Packet loss = 100% [11:35:59] RECOVERY - Memcached on mc2029 is OK: TCP OK - 3.070 second response time on 10.192.32.161 port 11214 https://wikitech.wikimedia.org/wiki/Memcached [11:36:01] RECOVERY - Host mc2029 is UP: PING OK - Packet loss = 0%, RTA = 33.10 ms [11:37:16] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2029.codfw.wmnet'] ` and were **ALL** successful. [11:50:15] (03CR) 10Zfilipin: [C: 03+1] selenium: Update README.md file [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/713217 (https://phabricator.wikimedia.org/T282237) (owner: 10Sahilgrewalhere) [12:08:31] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:08:45] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10wdwb-tech, and 4 others: Deploy query builder to microsites (on top of the wdqs-ui) - https://phabricator.wikimedia.org/T266703 (10Addshore) 05Open→03Resolved [12:21:30] (03PS18) 10Jbond: wmflib::role_hosts: new function return list of hosts running a role [puppet] - 10https://gerrit.wikimedia.org/r/692286 (https://phabricator.wikimedia.org/T282880) [12:26:11] (03PS1) 10Marostegui: dbproxy1019: Depool clouddb1015. [puppet] - 10https://gerrit.wikimedia.org/r/714562 [12:28:59] (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Depool clouddb1015. [puppet] - 10https://gerrit.wikimedia.org/r/714562 (owner: 10Marostegui) [12:30:13] !log Install 10.4.21 on clouddb1015 [12:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:32] !log test patched python3-eventlet on thanos-fe1003 - T283714 [12:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:37] T283714: Python 3's eventlet.green getaddrinfo timeout in Bullseye - https://phabricator.wikimedia.org/T283714 [12:36:12] !log disable puppet on P:tlsproxy::envoy hosts - merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/710507/9 [12:36:33] 10SRE, 10SRE-swift-storage, 10User-fgiunchedi: Python 3's eventlet.green getaddrinfo timeout in Bullseye - https://phabricator.wikimedia.org/T283714 (10fgiunchedi) I was able to get a working python3-eventlet package by integrating [[ https://github.com/eventlet/eventlet/pull/722/ | upstream PR ]], the easy... [12:37:08] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] envoyproxy: Add dual stack cert support [puppet] - 10https://gerrit.wikimedia.org/r/710507 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [12:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:44] (03PS1) 10Marostegui: Revert "dbproxy1019: Depool clouddb1015." [puppet] - 10https://gerrit.wikimedia.org/r/714163 [12:38:37] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1019: Depool clouddb1015." [puppet] - 10https://gerrit.wikimedia.org/r/714163 (owner: 10Marostegui) [12:45:25] !log enable puppet on P:tlsproxy::envoy hosts - merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/710507/9 [12:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:07] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/692286 (https://phabricator.wikimedia.org/T282880) (owner: 10Jbond) [12:47:05] (03PS2) 10Zabe: osm: migrate cron osm_sync_lag to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673) [12:47:20] (03PS5) 10MVernon: prometheus: couple mysqld exporter service to mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) [12:48:00] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714564 [12:48:08] (03CR) 10Zabe: osm: migrate cron osm_sync_lag to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:48:47] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [12:49:38] (03PS9) 10Vgutierrez: envoyproxy: Support ciphersuite configuration [puppet] - 10https://gerrit.wikimedia.org/r/710577 (https://phabricator.wikimedia.org/T271421) [12:50:47] (03CR) 10MVernon: prometheus: couple mysqld exporter service to mariadb service (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [13:02:33] !log push pfw policies - T289353 [13:03:46] no more bot? [13:04:00] no bot for restricted tasks [13:04:05] oh wait [13:04:18] that would be why it’s not posting the task title, but I assume it should still log to SAL… [13:05:27] testing T123456 [13:06:45] i guess it's down? [13:06:53] looks like it, it’s not responding over in -cloud either [13:08:10] from https://toolsadmin.wikimedia.org/tools/id/stashbot, it sounds hashar might be around to restart it (https://wikitech.wikimedia.org/wiki/Tool:Stashbot#Maintenance)? [13:08:19] was about to ping him and the other maintainers [13:09:15] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:37] T123456: Special:CentralAuth reports account attachment, which - being standalone - is confusing, report accout creation as well - https://phabricator.wikimedia.org/T123456 [13:12:48] it’s back \o/ [13:17:00] urbanecm: has.har is on vacation afaik [13:17:09] can I help? [13:19:14] majavah: looks like it resolved itself already (stashbot was stuck but now it’s back) [13:26:21] but thanks :) [13:28:48] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10Trizek-WMF) [13:29:20] 10SRE, 10Gerrit, 10Phabricator, 10Traffic: Phabricator and Gerrit: Improve the way that maintenance downtime is communicated to users. - https://phabricator.wikimedia.org/T180655 (10Aklapper) 05Open→03Resolved Assuming this is resolved per Dzahn's last comment. If not, then please reopen. [13:30:30] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10Trizek-WMF) >>! In T287546#7304462, @Elitre wrote: > @Trizek-WMF @Quiddity could you please verify whether the "two weeks before" tasks ha... [13:32:36] (03CR) 10Btullis: [C: 03+2] Improve creation of pkcs12 file by checking contents [puppet] - 10https://gerrit.wikimedia.org/r/709478 (https://phabricator.wikimedia.org/T287869) (owner: 10Btullis) [13:36:27] (03PS19) 10Jbond: wmflib::cache::nodes: test new puppetdb function [puppet] - 10https://gerrit.wikimedia.org/r/692286 [13:36:29] (03PS21) 10Jbond: sretest: test puppetdb function [puppet] - 10https://gerrit.wikimedia.org/r/692287 [13:38:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30805/console" [puppet] - 10https://gerrit.wikimedia.org/r/692287 (owner: 10Jbond) [13:38:44] (03CR) 10jerkins-bot: [V: 04-1] sretest: test puppetdb function [puppet] - 10https://gerrit.wikimedia.org/r/692287 (owner: 10Jbond) [13:39:24] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [13:42:05] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01437 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:42:36] looking ( btullis gussing this is R:sslcert::x509_to_pkcs12' [13:42:59] Yes. https://gerrit.wikimedia.org/r/c/operations/puppet/+/709478 [13:43:40] ack im looking at an-worker1097.eqiad.wmnet [13:46:31] (03CR) 10JMeybohm: [C: 04-1] utils/run_ci_locally: add ability to run the ci image interactively (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/714566 (owner: 10Jbond) [13:46:48] (03PS1) 10Jbond: sslcert::x509_to_pkcs12: fix heredoc [puppet] - 10https://gerrit.wikimedia.org/r/714568 [13:48:20] (03CR) 10Jbond: [C: 03+2] sslcert::x509_to_pkcs12: fix heredoc [puppet] - 10https://gerrit.wikimedia.org/r/714568 (owner: 10Jbond) [13:49:27] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10Trizek-WMF) [13:50:53] btullis: see 714568 for a fix, however there is another erro in that the unless line is failing so the p12 is getting re-genrated on each run. im looking [13:51:18] jbond: Thanks ever so much. [13:53:56] (03PS1) 10Jbond: sslcert::x509_to_pkcs12: fix unless statment [puppet] - 10https://gerrit.wikimedia.org/r/714570 [13:55:29] (03CR) 10Jbond: [C: 03+2] sslcert::x509_to_pkcs12: fix unless statment [puppet] - 10https://gerrit.wikimedia.org/r/714570 (owner: 10Jbond) [13:57:10] btullis: ^^^ thats seems to have fixed the other issue [13:57:19] just going to run puppet on the failed nondes [13:57:20] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mc2031.codfw.wmnet'... [13:57:57] Brilliant, thanks. So sorry for the trouble. I will be more vigilant with PCC next time. [14:00:05] btullis: in this instance pcc probably wouldn;t have helped that much, the issues where with exec resource which are had to capture in pcc as it dosn;t actully ru the commands [14:00:17] s/had/hard/ [14:01:08] Yes, I see. [14:01:31] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) The helm binary in `helmfile` can be set using the `--helm-binary` option or by setting `helmBinary` in the `helmfile.yaml`. It can be set globally (like in [admin-ng](https://ger... [14:02:47] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002875 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:07:39] (03PS2) 10Jbond: utils/run_ci_locally: add ability to run the ci image interactively [puppet] - 10https://gerrit.wikimedia.org/r/714566 [14:08:08] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/714566 (owner: 10Jbond) [14:08:10] (03CR) 10jerkins-bot: [V: 04-1] utils/run_ci_locally: add ability to run the ci image interactively [puppet] - 10https://gerrit.wikimedia.org/r/714566 (owner: 10Jbond) [14:08:26] (03PS1) 10JMeybohm: wmflib: Add disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714572 (https://phabricator.wikimedia.org/T288509) [14:09:26] (03PS3) 10Jbond: utils/run_ci_locally: add ability to run the ci image interactively [puppet] - 10https://gerrit.wikimedia.org/r/714566 [14:10:27] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Add disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714572 (https://phabricator.wikimedia.org/T288509) (owner: 10JMeybohm) [14:12:55] (03CR) 10Jelto: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/714071 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [14:14:29] (03PS2) 10JMeybohm: wmflib: Add disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714572 (https://phabricator.wikimedia.org/T288509) [14:16:30] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Add disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714572 (https://phabricator.wikimedia.org/T288509) (owner: 10JMeybohm) [14:19:27] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2031.codfw.wmnet with reason: REIMAGE [14:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:22] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 156 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [14:23:04] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2031.codfw.wmnet with reason: REIMAGE [14:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:11] jbond: looks like PS19 of https://gerrit.wikimedia.org/r/c/operations/puppet/+/692286 is an unintended push/review ? [14:23:30] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 74 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [14:24:04] (03CR) 10Btullis: Install Alluxio to the test cluster (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [14:25:22] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) >>! In T251305#7304745, @Jelto wrote: > The helm binary in `helmfile` can be set using the `--helm-binary` option or by setting `helmBinary` in the `helmfile.yaml`. > [...] Tha... [14:27:52] (03PS34) 10Btullis: Install Alluxio to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) [14:34:23] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2031.codfw.wmnet'] ` and were **ALL** successful. [14:35:53] (03CR) 10Michael DiPietro: "https://puppet-compiler.wmflabs.org/compiler1003/30809/" [puppet] - 10https://gerrit.wikimedia.org/r/714409 (owner: 10Michael DiPietro) [14:42:07] 10SRE, 10SRE-Access-Requests, 10Gerrit, 10LDAP-Access-Requests: Add dancy to `gerritadmin` LDAP group - https://phabricator.wikimedia.org/T289537 (10dancy) Thanks @jcrespo and apologies for not using the right format for the task description. [14:44:00] 10SRE, 10SRE-Access-Requests, 10Gerrit, 10LDAP-Access-Requests: Add dancy to `gerritadmin` LDAP group - https://phabricator.wikimedia.org/T289537 (10jcrespo) No issue, uniformization helps if we have to refer the task in the future and faster processing if the request is unclear, but it was not the case of... [14:44:36] 10Puppet, 10Infrastructure-Foundations, 10MW-on-K8s, 10Kubernetes, 10Patch-For-Review: Add a fact holding the type of a disk (spinning/ssd) - https://phabricator.wikimedia.org/T288509 (10JMeybohm) a:03JMeybohm [14:45:40] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Deploy a ResourceQuota to allow priority pods in kube-system [deployment-charts] - 10https://gerrit.wikimedia.org/r/714038 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [14:47:28] godog: thanks will fix shortly [14:48:07] (03Merged) 10jenkins-bot: admin_ng: Deploy a ResourceQuota to allow priority pods in kube-system [deployment-charts] - 10https://gerrit.wikimedia.org/r/714038 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [14:49:39] sure np [14:49:40] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:36] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:06] (03CR) 10Ahmon Dancy: [C: 03+2] "The prior gate pipeline failed:" [core] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714446 (owner: 10TrainBranchBot) [14:54:43] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:07] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:48] (03CR) 10Ahmon Dancy: [C: 03+1] profile::releases:common: Install emacs on releases servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714414 (owner: 10Ahmon Dancy) [15:02:25] (03PS35) 10Btullis: Install Alluxio to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) [15:06:04] (03CR) 10Elukey: "Ben I didn't check the whole change, but I am a little puzzled by the need of allowing ssh between master and workers. Is there a way to a" [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [15:08:07] (03CR) 10JMeybohm: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/714566 (owner: 10Jbond) [15:08:47] (03CR) 10JMeybohm: [C: 03+1] envoyproxy: Support ciphersuite configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710577 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [15:10:55] (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.20 [core] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714446 (owner: 10TrainBranchBot) [15:13:04] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10lmata) [15:13:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:13:45] (03CR) 10Vgutierrez: [C: 03+2] envoyproxy: Support ciphersuite configuration [puppet] - 10https://gerrit.wikimedia.org/r/710577 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [15:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:28] (03CR) 10Btullis: Install Alluxio to the test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [15:18:12] (03CR) 10Bstorm: [C: 03+1] "PCC looks just like expected (even if it's nearly useless here because it doesn't even read execs much)." [puppet] - 10https://gerrit.wikimedia.org/r/714409 (owner: 10Michael DiPietro) [15:18:57] (03CR) 10Jbond: [C: 03+2] utils/run_ci_locally: add ability to run the ci image interactively [puppet] - 10https://gerrit.wikimedia.org/r/714566 (owner: 10Jbond) [15:21:52] 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Use Grizzly for Varnish SLO Grafana dashboard - https://phabricator.wikimedia.org/T289036 (10lmata) [15:22:14] 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Use Grizzly for Varnish SLO Grafana dashboard - https://phabricator.wikimedia.org/T289036 (10lmata) a:03herron [15:23:50] 10SRE, 10Icinga, 10Observability-Alerting, 10observability: Extend dpkg Icinga check to also check for inconsistent apt state - https://phabricator.wikimedia.org/T190693 (10lmata) [15:23:53] !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@e02c602]: transfer_to_es: stop adding data to article_topics [15:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:07] (03PS20) 10Jbond: wmflib::cache::nodes: test new puppetdb function [puppet] - 10https://gerrit.wikimedia.org/r/692286 [15:24:31] (03PS22) 10Jbond: sretest: test puppetdb function [puppet] - 10https://gerrit.wikimedia.org/r/692287 [15:25:29] (03PS1) 10Jcrespo: mediabackup: Start backup of commonswiki on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/714587 [15:26:11] !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@e02c602]: transfer_to_es: stop adding data to article_topics (duration: 02m 17s) [15:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:26] (03CR) 10jerkins-bot: [V: 04-1] sretest: test puppetdb function [puppet] - 10https://gerrit.wikimedia.org/r/692287 (owner: 10Jbond) [15:27:11] (03PS7) 10Vgutierrez: envoyproxy: Support ECDH curves configuration [puppet] - 10https://gerrit.wikimedia.org/r/710581 (https://phabricator.wikimedia.org/T271421) [15:35:28] (03PS23) 10Jbond: sretest: test puppetdb function [puppet] - 10https://gerrit.wikimedia.org/r/692287 [15:35:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={rails,sidekiq} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:36:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30813/console" [puppet] - 10https://gerrit.wikimedia.org/r/692287 (owner: 10Jbond) [15:37:12] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30814/console" [puppet] - 10https://gerrit.wikimedia.org/r/710581 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [15:37:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:37:43] (03CR) 10Jbond: [C: 03+2] wmflib::cache::nodes: test new puppetdb function [puppet] - 10https://gerrit.wikimedia.org/r/692286 (owner: 10Jbond) [15:37:48] (03CR) 10Jbond: [V: 03+1 C: 03+2] sretest: test puppetdb function [puppet] - 10https://gerrit.wikimedia.org/r/692287 (owner: 10Jbond) [15:39:33] (03PS1) 10Jcrespo: mediabackup: Fix wrong version of s3 readandlist policy [puppet] - 10https://gerrit.wikimedia.org/r/714588 (https://phabricator.wikimedia.org/T276442) [15:39:48] (03PS2) 10Jcrespo: mediabackup: Fix wrong version of s3 readandlist policy [puppet] - 10https://gerrit.wikimedia.org/r/714588 (https://phabricator.wikimedia.org/T276442) [15:41:47] (03PS1) 10Jbond: wmflib::role_hosts: return a sorted array [puppet] - 10https://gerrit.wikimedia.org/r/714589 [15:42:42] (03CR) 10Filippo Giunchedi: [C: 03+1] "Can't comment on the exact semantics but from commit message LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/714587 (owner: 10Jcrespo) [15:43:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:44:54] (03PS1) 10Bstorm: cloud-vps: don't do nag alerts for puppet on some projects [puppet] - 10https://gerrit.wikimedia.org/r/714590 [15:45:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:45:27] (03CR) 10Jbond: [C: 03+2] wmflib::role_hosts: return a sorted array [puppet] - 10https://gerrit.wikimedia.org/r/714589 (owner: 10Jbond) [15:46:58] (03CR) 10Btullis: Install Alluxio to the test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [15:47:41] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Start backup of commonswiki on eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714587 (owner: 10Jcrespo) [15:47:55] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Fix wrong version of s3 readandlist policy [puppet] - 10https://gerrit.wikimedia.org/r/714588 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [15:48:29] (03CR) 10Majavah: cloud-vps: don't do nag alerts for puppet on some projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714590 (owner: 10Bstorm) [15:49:00] (03PS1) 10Jbond: use to_yaml [puppet] - 10https://gerrit.wikimedia.org/r/714592 [15:49:16] (03CR) 10Jbond: [V: 03+2 C: 03+2] use to_yaml [puppet] - 10https://gerrit.wikimedia.org/r/714592 (owner: 10Jbond) [15:49:31] (03CR) 10Bstorm: cloud-vps: don't do nag alerts for puppet on some projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714590 (owner: 10Bstorm) [15:49:38] (03CR) 10Michael DiPietro: [C: 03+2] remove lvm [puppet] - 10https://gerrit.wikimedia.org/r/714409 (owner: 10Michael DiPietro) [15:51:19] (03PS2) 10Bstorm: cloud-vps: don't do nag alerts for puppet on some projects [puppet] - 10https://gerrit.wikimedia.org/r/714590 [15:52:05] (03CR) 10Bstorm: cloud-vps: don't do nag alerts for puppet on some projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714590 (owner: 10Bstorm) [15:54:13] (03PS1) 10Herron: cleanup kafkamon role description [puppet] - 10https://gerrit.wikimedia.org/r/714593 (https://phabricator.wikimedia.org/T252773) [15:55:56] (03PS2) 10Herron: cleanup kafkamon role description [puppet] - 10https://gerrit.wikimedia.org/r/714593 (https://phabricator.wikimedia.org/T252773) [15:58:33] (03CR) 10Jbond: [C: 03+1] Disable the "long running screen/tmux session" check by default [puppet] - 10https://gerrit.wikimedia.org/r/712123 (https://phabricator.wikimedia.org/T288028) (owner: 10Muehlenhoff) [16:00:04] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210824T1600). [16:03:55] (03CR) 10Herron: [C: 03+2] cleanup kafkamon role description [puppet] - 10https://gerrit.wikimedia.org/r/714593 (https://phabricator.wikimedia.org/T252773) (owner: 10Herron) [16:10:59] (03CR) 10David Caro: cloud-vps: don't do nag alerts for puppet on some projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714590 (owner: 10Bstorm) [16:12:12] (03CR) 10Bstorm: cloud-vps: don't do nag alerts for puppet on some projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714590 (owner: 10Bstorm) [16:14:35] (03PS3) 10Bstorm: cloud-vps: don't do nag alerts for puppet on some projects [puppet] - 10https://gerrit.wikimedia.org/r/714590 [16:14:47] (03CR) 10Bstorm: cloud-vps: don't do nag alerts for puppet on some projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714590 (owner: 10Bstorm) [16:35:33] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 271 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:39:23] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:40:22] `i/R/ResponseFactory:42 JSON encoding error: Malformed UTF-8 characters, possibly incorrectly encoded` [16:40:42] (03CR) 10Andrew Bogott: [C: 03+1] "This seems fine. Someday we might want to be able to more actively configure the opt-out but for the foreseeable future this is just fine." [puppet] - 10https://gerrit.wikimedia.org/r/714590 (owner: 10Bstorm) [16:42:03] (03CR) 10Andrew Bogott: cloud-vps: don't do nag alerts for puppet on some projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714590 (owner: 10Bstorm) [16:43:50] (03PS4) 10Bstorm: cloud-vps: don't do nag alerts for puppet on some projects [puppet] - 10https://gerrit.wikimedia.org/r/714590 [16:44:21] (03CR) 10Bstorm: cloud-vps: don't do nag alerts for puppet on some projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714590 (owner: 10Bstorm) [16:46:44] (03CR) 10Andrew Bogott: [C: 03+1] cloud-vps: don't do nag alerts for puppet on some projects [puppet] - 10https://gerrit.wikimedia.org/r/714590 (owner: 10Bstorm) [16:49:57] (03CR) 10Bstorm: [C: 03+2] cloud-vps: don't do nag alerts for puppet on some projects [puppet] - 10https://gerrit.wikimedia.org/r/714590 (owner: 10Bstorm) [16:50:21] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 105 probes of 616 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:53:01] (03PS1) 10MSantos: mobileapps: bump to 2021-08-24-144003-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/714601 [16:56:15] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 44 probes of 616 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:59:46] (03CR) 10Krinkle: [C: 03+1] profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711580 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke) [17:00:05] chrisalbon and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210824T1700). [17:00:19] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 293 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:00:49] (03PS1) 10Legoktm: shellbox: Avoid repeating image name in helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714606 [17:04:11] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:09:18] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2021-08-24-144003-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/714601 (owner: 10MSantos) [17:12:16] (03Merged) 10jenkins-bot: mobileapps: bump to 2021-08-24-144003-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/714601 (owner: 10MSantos) [17:13:29] (03CR) 10Thcipriani: [V: 03+2 C: 03+2] Review access change [software/mailman-templates] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/709976 (https://phabricator.wikimedia.org/T288027) (owner: 10Hashar) [17:17:10] !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [17:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:27] PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 7.535e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [17:19:28] eqiad maps OSM DB is under maintenance and depooled, disabling the alert ^ [17:19:49] !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [17:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:07] !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [17:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:18] dancy: that's somewhat worrying. at least at that frequency seems problematic by itself even if no impact (but there likely is impact) [17:29:50] "/w/rest.php/v1/file/File:**.svg" [17:29:54] That seems like a new API [17:30:05] I've not seen rest/file before [17:30:25] ^ filed at T289597 [17:30:26] T289597: MediaWiki\Rest: JSON encoding error: Malformed UTF-8 characters, possibly incorrectly encoded - https://phabricator.wikimedia.org/T289597 [17:30:35] Thanks brennen [17:36:25] (03CR) 1020after4: [C: 03+2] selenium: Update README.md file [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/713217 (https://phabricator.wikimedia.org/T282237) (owner: 10Sahilgrewalhere) [17:54:58] 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10JMando) [18:00:05] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210824T1800) [18:30:29] (03CR) 10Gehel: [C: 03+1] "LGTM, feel free to merge" [cookbooks] - 10https://gerrit.wikimedia.org/r/713931 (https://phabricator.wikimedia.org/T280221) (owner: 10Gehel) [18:34:23] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:35:11] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:47:50] (03CR) 10Gehel: [C: 04-1] blazegraph: Setup new wcqs instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713946 (owner: 10Ebernhardson) [19:00:05] dancy and brennen: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210824T1900). [19:00:30] I will start doing stuff in 30 minutes. [19:01:50] (03CR) 10Ebernhardson: blazegraph: Setup new wcqs instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713946 (owner: 10Ebernhardson) [19:08:56] (03PS5) 10Ebernhardson: blazegraph: Setup new wcqs instances [puppet] - 10https://gerrit.wikimedia.org/r/713946 [19:17:14] (03PS2) 10Legoktm: shellbox: Avoid repeating image name in helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714606 [19:34:40] (03PS5) 10Ssingh: wikidough check: example authdns part [dns] - 10https://gerrit.wikimedia.org/r/711577 (owner: 10BBlack) [19:35:10] (03CR) 10Ssingh: "(CI will fail till https://gerrit.wikimedia.org/r/c/integration/config/+/712991 is merged)" [dns] - 10https://gerrit.wikimedia.org/r/711577 (owner: 10BBlack) [19:35:38] (03CR) 10jerkins-bot: [V: 04-1] wikidough check: example authdns part [dns] - 10https://gerrit.wikimedia.org/r/711577 (owner: 10BBlack) [19:40:09] ok. .back on train duty [19:40:28] (03PS1) 10Ahmon Dancy: testwikis wikis to 1.37.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714619 [19:40:30] (03CR) 10Ahmon Dancy: [C: 03+2] testwikis wikis to 1.37.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714619 (owner: 10Ahmon Dancy) [19:41:30] (03Merged) 10jenkins-bot: testwikis wikis to 1.37.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714619 (owner: 10Ahmon Dancy) [19:41:31] !log dancy@deploy1002 Started scap: testwikis wikis to 1.37.0-wmf.20 [19:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:22] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) [20:02:56] (03PS1) 10Ebernhardson: query_service: support multiple variants of wdqs microsite [puppet] - 10https://gerrit.wikimedia.org/r/714624 (https://phabricator.wikimedia.org/T280247) [20:04:27] (03CR) 10jerkins-bot: [V: 04-1] query_service: support multiple variants of wdqs microsite [puppet] - 10https://gerrit.wikimedia.org/r/714624 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [20:09:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:20] (03PS2) 10Ebernhardson: query_service: support multiple variants of wdqs microsite [puppet] - 10https://gerrit.wikimedia.org/r/714624 (https://phabricator.wikimedia.org/T280247) [20:16:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:03] !log dancy@deploy1002 Finished scap: testwikis wikis to 1.37.0-wmf.20 (duration: 36m 32s) [20:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:19] (03PS3) 10Ebernhardson: query_service: support multiple variants of wdqs microsite [puppet] - 10https://gerrit.wikimedia.org/r/714624 (https://phabricator.wikimedia.org/T280247) [20:21:53] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [20:24:36] (03CR) 10Ssingh: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/711577 (owner: 10BBlack) [20:24:45] (03PS1) 10Ahmon Dancy: group0 wikis to 1.37.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714629 [20:24:47] (03CR) 10Ahmon Dancy: [C: 03+2] group0 wikis to 1.37.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714629 (owner: 10Ahmon Dancy) [20:25:29] (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714629 (owner: 10Ahmon Dancy) [20:25:35] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:27:01] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.20 [20:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:31] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10RobH) [20:30:55] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10RobH) [20:31:43] 10ops-codfw, 10DC-Ops, 10observability, 10SRE Observability (FY2021/2022-Q1): (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10RobH) [20:32:06] 10ops-codfw, 10DC-Ops, 10observability, 10SRE Observability (FY2021/2022-Q1): (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10RobH) a:03Papaul [20:33:19] !log dancy@deploy1002 Pruned MediaWiki: 1.37.0-wmf.18 (duration: 03m 26s) [20:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:33] !log dancy@deploy1002 Pruned MediaWiki: 1.37.0-wmf.17 (duration: 01m 48s) [20:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:27] PROBLEM - Host wdqs1013 is DOWN: PING CRITICAL - Packet loss = 100% [20:45:37] RECOVERY - Host wdqs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [20:46:55] 10SRE, 10Continuous-Integration-Infrastructure, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10Patch-For-Review, 10Release-Engineering-Team (Done by Fri 03 Sep): Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10dancy) Apologies for the delay... [20:48:27] dancy: I'm sorry, I forgot that on Tuesday, there's only one EU-friendly B&C. Could I please sneak in a quick patch? [20:48:35] Sure [20:48:39] I'm done [20:48:42] okay, thanks! [20:48:56] (03CR) 10Urbanecm: [C: 03+2] Growth features: Promote 9 wikis out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714366 (https://phabricator.wikimedia.org/T287871) (owner: 10Urbanecm) [20:49:01] (03PS3) 10Urbanecm: Growth features: Promote 9 wikis out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714366 (https://phabricator.wikimedia.org/T287871) [20:49:04] (03CR) 10Urbanecm: [C: 03+2] Growth features: Promote 9 wikis out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714366 (https://phabricator.wikimedia.org/T287871) (owner: 10Urbanecm) [20:49:53] (03Merged) 10jenkins-bot: Growth features: Promote 9 wikis out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714366 (https://phabricator.wikimedia.org/T287871) (owner: 10Urbanecm) [20:54:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:00] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: a6fd96b15e6e3c068c2faac60208b9722d32af0f: Growth features: Promote 9 wikis out of dark mode (T287871; T287874; T287872; T287880; T287868; T287873; T287879; T287875; T287876) (duration: 01m 25s) [20:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:14] T287876: Deploy Growth features on Slovenian Wikipedia - https://phabricator.wikimedia.org/T287876 [20:55:14] T287871: Deploy Growth features on Azerbaijani Wikipedia - https://phabricator.wikimedia.org/T287871 [20:55:15] T287868: Deploy Growth features on Kurdish Wikipedia - https://phabricator.wikimedia.org/T287868 [20:55:15] T287880: Deploy Growth features on Georgian Wikipedia - https://phabricator.wikimedia.org/T287880 [20:55:15] T287879: Deploy Growth features on Malayalam Wikipedia - https://phabricator.wikimedia.org/T287879 [20:55:15] T287874: Deploy Growth features on Estonian Wikipedia - https://phabricator.wikimedia.org/T287874 [20:55:16] T287873: Deploy Growth features on Lithuanian Wikipedia - https://phabricator.wikimedia.org/T287873 [20:55:16] T287872: Deploy Growth features on Finnish Wikipedia - https://phabricator.wikimedia.org/T287872 [20:55:16] T287875: Deploy Growth features on Marathi Wikipedia - https://phabricator.wikimedia.org/T287875 [20:55:20] * urbanecm done [20:55:22] thanks dancy [20:55:26] np [20:59:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] tgr: How many deployers does it take to do Long-running script deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210824T2100). [21:10:22] (03PS1) 10Arlolra: Disable legacy media dom on a few more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714635 [21:10:23] !log running extensions/GrowthExperiments/maintenance/revalidateLinkRecommendations.php on various wikis per T282873#7303828 [21:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:27] T282873: Add Link: Fix production discrepancies between the link recommendation table and the search index - https://phabricator.wikimedia.org/T282873 [21:16:20] (03PS1) 10RobH: updating for r740gpu config [software] - 10https://gerrit.wikimedia.org/r/714638 [21:17:09] (03CR) 10RobH: [C: 03+2] updating for r740gpu config [software] - 10https://gerrit.wikimedia.org/r/714638 (owner: 10RobH) [21:17:25] tgr: this script will drop suggestions from the old add a link model, is that right? Or is it for a diff issue, [21:18:05] (03Merged) 10jenkins-bot: updating for r740gpu config [software] - 10https://gerrit.wikimedia.org/r/714638 (owner: 10RobH) [21:18:11] (03CR) 10Subramanya Sastry: "It might be useful to pin this to a phab task in case someone has feedback after deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714635 (owner: 10Arlolra) [21:23:35] (03PS4) 10Ebernhardson: query_service: support multiple variants of wdqs microsite [puppet] - 10https://gerrit.wikimedia.org/r/714624 (https://phabricator.wikimedia.org/T280247) [21:27:33] (03CR) 10Ssingh: wikidough check: example authdns part (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/711577 (owner: 10BBlack) [21:32:46] (03PS3) 10Legoktm: [WIP] shellbox: Avoid repeating image name in helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714606 [21:36:17] (03PS4) 10Legoktm: [WIP] shellbox: Avoid repeating image name in helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714606 [21:40:36] (03CR) 10RLazarus: [C: 03+2] hieradata: Run httpbb hourly from cumin2001 against a codfw appserver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714137 (https://phabricator.wikimedia.org/T289202) (owner: 10RLazarus) [21:41:56] (03PS1) 10Michael DiPietro: update quarry systemd and branch [puppet] - 10https://gerrit.wikimedia.org/r/714640 [21:42:38] (03PS5) 10Legoktm: [WIP] shellbox: Avoid repeating image name in helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714606 [21:42:40] (03PS1) 10Legoktm: shellbox: Actually use value of main_app.version [deployment-charts] - 10https://gerrit.wikimedia.org/r/714641 [21:46:19] (03CR) 10Legoktm: [C: 03+2] shellbox: Actually use value of main_app.version [deployment-charts] - 10https://gerrit.wikimedia.org/r/714641 (owner: 10Legoktm) [21:47:25] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:48:42] (03Merged) 10jenkins-bot: shellbox: Actually use value of main_app.version [deployment-charts] - 10https://gerrit.wikimedia.org/r/714641 (owner: 10Legoktm) [21:50:47] rzl: ^ I assume you're on that httpbb failure [21:52:16] urbanecm: it replaces the ones where the model ID is not current, yeah. [21:52:41] (03CR) 10Legoktm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/714606 (owner: 10Legoktm) [21:52:42] legoktm: yep that's me from 714137, thanks [21:52:43] (03PS2) 10Ahmon Dancy: profile::releases:common: Install emacs on releases servers [puppet] - 10https://gerrit.wikimedia.org/r/714414 [21:53:13] (03CR) 10jerkins-bot: [V: 04-1] profile::releases:common: Install emacs on releases servers [puppet] - 10https://gerrit.wikimedia.org/r/714414 (owner: 10Ahmon Dancy) [21:54:19] No such file or directory: '/srv/deployment/httpbb-tests/appserver/*.yaml' [21:54:21] ahahaha fair enough! [21:54:33] one /bin/sh -c coming up [21:54:42] hehe [21:55:36] (03CR) 10Legoktm: [C: 03+2] [WIP] shellbox: Avoid repeating image name in helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714606 (owner: 10Legoktm) [21:55:49] (03PS6) 10Legoktm: shellbox: Avoid repeating image name in helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714606 [21:55:55] (03CR) 10Legoktm: [C: 03+2] shellbox: Avoid repeating image name in helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714606 (owner: 10Legoktm) [21:56:57] (03Abandoned) 10Legoktm: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714130 (owner: 10PipelineBot) [21:57:00] (03Abandoned) 10Legoktm: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714129 (owner: 10PipelineBot) [21:57:13] (03PS3) 10Ahmon Dancy: profile::releases:common: Install emacs on releases servers [puppet] - 10https://gerrit.wikimedia.org/r/714414 [21:58:27] (03Merged) 10jenkins-bot: shellbox: Avoid repeating image name in helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714606 (owner: 10Legoktm) [22:01:09] (03CR) 10Ahmon Dancy: profile::releases:common: Install emacs on releases servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714414 (owner: 10Ahmon Dancy) [22:04:15] !log legoktm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox-constraints' for release 'main' . [22:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:30] !log legoktm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox' for release 'main' . [22:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:09] (03PS1) 10RLazarus: httpbb: Wrap the systemd ExecCommand in "sh -c" so the wildcard works. [puppet] - 10https://gerrit.wikimedia.org/r/714642 (https://phabricator.wikimedia.org/T289202) [22:06:27] legoktm: ^ mailed you the fix, if you have a sec to glance at it [22:07:51] (03CR) 10Legoktm: [C: 03+1] "Looks correct" [puppet] - 10https://gerrit.wikimedia.org/r/714642 (https://phabricator.wikimedia.org/T289202) (owner: 10RLazarus) [22:08:00] thanks! [22:08:14] (03CR) 10RLazarus: [C: 03+2] httpbb: Wrap the systemd ExecCommand in "sh -c" so the wildcard works. [puppet] - 10https://gerrit.wikimedia.org/r/714642 (https://phabricator.wikimedia.org/T289202) (owner: 10RLazarus) [22:10:40] (03CR) 10Ebernhardson: "pcc seems reasonable: https://puppet-compiler.wmflabs.org/compiler1002/30820/" [puppet] - 10https://gerrit.wikimedia.org/r/714624 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [22:11:06] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:11:18] 😎 [22:11:32] (03PS4) 10Ahmon Dancy: profile::releases:common: Install emacs on releases servers [puppet] - 10https://gerrit.wikimedia.org/r/714414 [22:12:23] (03CR) 10jerkins-bot: [V: 04-1] profile::releases:common: Install emacs on releases servers [puppet] - 10https://gerrit.wikimedia.org/r/714414 (owner: 10Ahmon Dancy) [22:16:48] (03PS5) 10Ahmon Dancy: profile::releases:common: Install emacs on releases servers [puppet] - 10https://gerrit.wikimedia.org/r/714414 [22:29:35] (03PS2) 10Arlolra: Disable legacy media dom on a few more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714635 (https://phabricator.wikimedia.org/T51097) [22:31:08] 10ops-eqiad, 10Analytics, 10DC-Ops: (Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10RobH) [22:31:44] (03PS1) 10RLazarus: hieradata: Run httpbb hourly from cumin1001 against an eqiad appserver. [puppet] - 10https://gerrit.wikimedia.org/r/714646 (https://phabricator.wikimedia.org/T289202) [22:32:00] 10ops-eqiad, 10Analytics, 10DC-Ops: (Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10RobH) [22:32:07] 10ops-eqiad, 10Analytics, 10DC-Ops: (Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10RobH) [22:32:36] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30825/console" [puppet] - 10https://gerrit.wikimedia.org/r/714646 (https://phabricator.wikimedia.org/T289202) (owner: 10RLazarus) [22:33:45] 10ops-eqiad, 10Analytics, 10DC-Ops: (Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10RobH) [22:34:00] 10ops-eqiad, 10Analytics, 10DC-Ops: (Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10RobH) a:03Jclark-ctr [22:44:11] (03CR) 10Subramanya Sastry: [C: 03+1] Disable legacy media dom on a few more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714635 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210824T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:05:41] jouncebot: s/windowYour/window. Your/ :P [23:06:47] * urbanecm needs someone to tell him to not be lazy and fix jerkins's complaint in https://gerrit.wikimedia.org/r/c/wikimedia/bots/jouncebot/+/708756 Platonides [23:09:21] oh, it picks that part from the calendar [23:09:38] yeah [23:09:55] looking at the patch, no need to use a regex there [23:09:59] it's just a string replacement [23:10:13] right [23:10:20] so just .replace("...", "") [23:10:36] yes [23:11:33] jenkins is being picky about line lengths :/ [23:11:55] yeah :/ [23:13:03] * Platonides mumbles that it may take longer for him to clone than to edit it [23:15:34] * legoktm clones Platonides [23:16:04] (03PS2) 10Platonides: Hide warning notice from backport window announcements [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/708756 (owner: 10Urbanecm) [23:16:05] xD [23:16:23] oh, even better. someone else fixes that patch! [23:17:03] (03CR) 10jerkins-bot: [V: 04-1] Hide warning notice from backport window announcements [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/708756 (owner: 10Urbanecm) [23:18:05] developing cloning techniques would probably fall under the strategic plan somewhere >.> [23:18:24] xDD [23:19:08] it is set at 79 chars? come on! [23:19:13] * urbanecm fixes that patch [23:19:42] i guess this is okay https://www.irccloud.com/pastebin/puMwr9IZ/ [23:19:51] (03PS3) 10Urbanecm: Hide warning notice from backport window announcements [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/708756 [23:20:05] Platonides: https://gerrit.wikimedia.org/r/c/wikimedia/bots/jouncebot/+/708756/2..3/jouncebot/deploypage.py [23:20:18] (03CR) 10jerkins-bot: [V: 04-1] Hide warning notice from backport window announcements [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/708756 (owner: 10Urbanecm) [23:20:22] :( [23:20:44] it also complained at the definition [23:20:48] meh [23:21:17] my turn [23:21:37] (03PS4) 10Urbanecm: Hide warning notice from backport window announcements [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/708756 [23:21:49] Platonides: too late. Used # noqa E501 as my hammer [23:21:59] feel free to overwrite if you have something better in mind [23:22:11] (03CR) 10jerkins-bot: [V: 04-1] Hide warning notice from backport window announcements [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/708756 (owner: 10Urbanecm) [23:22:48] (03PS5) 10Platonides: Hide warning notice from backport window announcements [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/708756 (owner: 10Urbanecm) [23:23:08] race condition :P [23:23:17] (03CR) 10jerkins-bot: [V: 04-1] Hide warning notice from backport window announcements [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/708756 (owner: 10Urbanecm) [23:24:08] (03PS6) 10Urbanecm: Hide warning notice from backport window announcements [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/708756 [23:24:13] F821 undefined name 'BACKPORT_WARNING' ? [23:24:39] (03CR) 10jerkins-bot: [V: 04-1] Hide warning notice from backport window announcements [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/708756 (owner: 10Urbanecm) [23:25:04] Platonides: self.BACKPORT_WARNING [23:25:09] that would work [23:25:10] oh, right [23:25:20] fixed that [23:25:23] now, tell that to jerkins [23:25:23] but it still doesn't like me [23:25:47] and I had missed the '' that second time [23:25:53] yup [23:26:07] Platonides: will you reformat the def to please jenkins, or shall I? [23:26:12] ok, I do [23:26:13] it wants you to do this https://www.irccloud.com/pastebin/Ok9m1WAv/ [23:26:30] yes, I saw [23:26:41] cool :) [23:27:12] (03PS7) 10Platonides: Hide warning notice from backport window announcements [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/708756 (owner: 10Urbanecm) [23:27:38] it takes too long to please jenkins for such silly change [23:27:58] unfortunately [23:28:05] i kinda like the collaborative work though :D [23:29:06] makes it more fun [23:29:15] jenkins finally liked it :D [23:29:18] great! [23:29:23] now we need a human to like it too [23:30:57] I think it's ok, but I don't have +2 there [23:31:59] https://gerrit.wikimedia.org/r/admin/repos/wikimedia,access says anyone in ldap/wmf should [23:32:05] legoktm: maybe? 🙂 [23:32:42] sure [23:33:18] (03CR) 10Legoktm: [C: 03+2] Hide warning notice from backport window announcements [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/708756 (owner: 10Urbanecm) [23:33:23] thanks [23:33:30] I have no idea how to deploy it though, I think it's on Toolforge [23:33:38] jouncebot: help [23:33:38] **** JounceBot Help **** [23:33:38] JounceBot is a deployment helper bot for the Wikimedia movement. [23:33:38] Source at: https://gerrit.wikimedia.org/g/wikimedia/bots/jouncebot [23:33:38] Available commands: [23:33:38] HELP Print all commands known to the server. [23:33:39] NEXT Get the next deployment event(s if they happen at the same time). [23:33:39] NOW Get the current deployment event(s) or the time until the next. [23:33:40] REFRESH Refresh my knowledge about deployments. [23:33:51] (03Merged) 10jenkins-bot: Hide warning notice from backport window announcements [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/708756 (owner: 10Urbanecm) [23:33:52] legoktm: https://wikitech.wikimedia.org/wiki/Tool:Jouncebot has docs for that [23:34:44] error: cannot rebase: You have unstaged changes. [23:34:44] error: Please commit or stash them. [23:34:50] :( [23:35:11] legoktm: I can work it out. I left a mess there [23:35:43] mostly a pile of half done things from the irc network migration [23:35:50] ok, the git diff just says it's changing stuff to be -libera [23:35:50] yeah [23:39:45] jouncebot: now [23:39:46] For the next 0 hour(s) and 20 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210824T2300) [23:39:46] jouncebot: now [23:39:46] For the next 0 hour(s) and 20 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210824T2300) [23:39:52] thanks bd808 :) [23:40:01] yw. thanks for the fix! [23:42:03] :D [23:48:59] (03CR) 10Cwhite: [C: 03+1] profile: remove thanos alerts, moved to alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/714541 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [23:50:09] (03CR) 10Cwhite: [C: 03+1] o11y: add alerts ported from icinga/upstream [alerts] - 10https://gerrit.wikimedia.org/r/714543 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [23:52:04] (03PS1) 10BryanDavis: cleanup: commit some changes from libera.chat migration [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/714652 [23:52:44] (03CR) 10BryanDavis: [C: 03+2] cleanup: commit some changes from libera.chat migration [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/714652 (owner: 10BryanDavis) [23:53:14] (03Merged) 10jenkins-bot: cleanup: commit some changes from libera.chat migration [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/714652 (owner: 10BryanDavis) [23:53:22] (03PS2) 10BryanDavis: Rephrase Bot under the Fountain message [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/710028 (owner: 10Lucas Werkmeister (WMDE)) [23:53:27] (03CR) 10BryanDavis: [C: 03+2] Rephrase Bot under the Fountain message [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/710028 (owner: 10Lucas Werkmeister (WMDE)) [23:53:59] (03Merged) 10jenkins-bot: Rephrase Bot under the Fountain message [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/710028 (owner: 10Lucas Werkmeister (WMDE)) [23:54:52] (03PS4) 10BryanDavis: Add a nowandnext command [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/704446 (https://phabricator.wikimedia.org/T286627) (owner: 10Reedy) [23:54:59] (03CR) 10BryanDavis: [C: 03+2] Add a nowandnext command [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/704446 (https://phabricator.wikimedia.org/T286627) (owner: 10Reedy) [23:55:32] (03Merged) 10jenkins-bot: Add a nowandnext command [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/704446 (https://phabricator.wikimedia.org/T286627) (owner: 10Reedy) [23:59:14] jouncebot: nowandnext [23:59:14] For the next 0 hour(s) and 0 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210824T2300) [23:59:14] In 11 hour(s) and 0 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210825T1100)