[00:00:05] twentyafterfour: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Phabricator update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211014T0000). [01:22:15] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2021-10-13-231209-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730662 (https://phabricator.wikimedia.org/T293103) (owner: 10BryanDavis) [01:24:29] (03PS3) 10RLazarus: Minimal version of the image catalog [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/723663 (https://phabricator.wikimedia.org/T287130) [01:26:34] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2021-10-13-231209-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730662 (https://phabricator.wikimedia.org/T293103) (owner: 10BryanDavis) [01:28:35] (03PS4) 10RLazarus: Minimal version of the image catalog [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/723663 (https://phabricator.wikimedia.org/T287130) [01:30:42] (03CR) 10RLazarus: "Thanks, PTAL!" [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/723663 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [01:31:33] !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [01:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:39] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:35:59] !log bd808@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'toolhub' for release 'main' . [01:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:37] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:50:55] !log changing user email for "Region of Peel Archives" [01:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:06] !log bd808@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' . [01:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:47] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: generate_os_reports.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:09:35] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:13:05] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:21:41] PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:21:55] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:25:04] (03CR) 10MSantos: [C: 04-1] Add script to send tile invalidation events (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/722825 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [03:40:27] (03CR) 10MSantos: [C: 04-1] Add script to send tile invalidation events (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/722825 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [04:05:41] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:05:57] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:14:01] (03CR) 10Juan90264: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720320 (https://phabricator.wikimedia.org/T289752) (owner: 10Rishabhbhat) [04:29:11] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (graphite1004, ...), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:33:07] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:34:35] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:43:09] RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:43:17] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:47:53] (Primary inbound port utilisation over 80% #page) firing: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [04:47:53] (Primary inbound port utilisation over 80% #page) firing: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [04:48:32] out for a walk but I'll be back at keys soon [04:51:46] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [04:51:46] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [04:54:17] o/ [04:56:00] telia eqiad-codfw link is down again and the other one is >80% again [04:56:46] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [04:56:46] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [04:57:39] should we depool codfw then? not sure how worrying ">80%" is [04:57:53] (Primary inbound port utilisation over 80% #page) resolved: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [04:57:53] (Primary inbound port utilisation over 80% #page) resolved: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [04:58:37] I raised the thresholds to 90% [05:00:57] here - still need anything? [05:02:20] we should be good [05:02:36] thanks <3 [05:02:44] Telia maintenance ends in ~5h: Service window end: 2021-10-14 10:00 UTC [05:03:43] Depooling codfw could be an option, as well as routing ulsfo through eqord for example [05:06:24] I'm also wondering if there are optional eqiad/codfw transfers we could hold until the other link is up [05:07:32] we could pool restbase-async in eqiad so it's not going cross-dc to codfw. Don't know how much traffic that actually is though [05:10:00] legoktm: most of the traffic will be replication traffic [05:10:14] I am more worried about the docker registry [05:10:31] we might want to pool eqiad instead, but we need to switch the replication of swift [05:10:43] not something I'd do before I get any caffeine [05:30:17] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:48:09] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:48:27] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:08:44] 10SRE, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, 10Platform Engineering (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Aklapper) @Seddon: Could you elaborate on these bullet points, please? Thanks! [06:10:05] (03CR) 10Elukey: [C: 03+1] istio: Add wmf-certificates proxyv2 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/730591 (owner: 10JMeybohm) [06:12:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: Add rsyslog sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/725892 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [06:16:39] (03Merged) 10jenkins-bot: mediawiki: Add rsyslog sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/725892 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [06:20:24] 10SRE-swift-storage, 10TimedMediaHandler-Transcode: Intermittent transcode failure 'An unknown error occurred in storage backend "local-swift-codfw".' - https://phabricator.wikimedia.org/T201090 (10Aklapper) [06:21:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: introduce the common_images data structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/730557 (https://phabricator.wikimedia.org/T291530) (owner: 10Giuseppe Lavagetto) [06:21:15] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: introduce the common_images data structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/730557 (https://phabricator.wikimedia.org/T291530) (owner: 10Giuseppe Lavagetto) [06:21:17] (03PS3) 10Giuseppe Lavagetto: mediawiki: introduce the common_images data structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/730557 (https://phabricator.wikimedia.org/T291530) [06:29:47] (03PS3) 10Juan90264: Create Salima namespace for dagwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730579 (https://phabricator.wikimedia.org/T289911) [06:37:22] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [06:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:54] 10SRE, 10Wikimedia-Mailing-lists, 10I18n: mailman3 encoding issues on unsubscription emails - https://phabricator.wikimedia.org/T290613 (10Aklapper) @MarcoAurelio: Do you know the user's exact email client being used? [07:06:34] (03CR) 10Ayounsi: "I'd recommend:" [puppet] - 10https://gerrit.wikimedia.org/r/730619 (owner: 10Ssingh) [07:14:58] something that didn't occur to me last night but we can do, is put back traffic to swift eqiad to help with link utilization [07:15:03] cc joe XioNoX ^ [07:15:22] godog: oh! yeah I'm sure that will help [07:15:38] +1 [07:15:41] I wasn't aware it was still depooled [07:15:58] thanks! [07:16:05] ok I'll put back swift traffic in eqiad cc Emperor [07:16:08] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, one comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/730523 (owner: 10Jbond) [07:16:59] for the record: I'm just pooling eqiad but not depooling codfw [07:17:26] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:49] godog: ack makes sense [07:18:51] !log filippo@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=swift,name=eqiad [07:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:57] !log filippo@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=swift-ro,name=eqiad [07:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:11] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/730524 (owner: 10Jbond) [07:20:17] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] istio: Add wmf-certificates proxyv2 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/730591 (owner: 10JMeybohm) [07:20:40] but yes eqiad was still depooled while we finished putting the new hw in service, which completed earlier this week [07:22:29] godog: is there a cookbook for that, or were you running conftool by hand there? [07:22:50] Emperor: yeah just conftool [07:23:02] Emperor: that's literally one conftool command, but there is also a cookbook IIRC, let me search [07:23:55] Emperor: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/discovery/service-route.py [07:24:31] ta [07:24:51] !log oblivian@cumin1001 START - Cookbook sre.discovery.service-route [07:24:51] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) [07:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:12] Emperor: but amazingly, "check" doesn't work with swift [07:25:30] !log oblivian@cumin1001 START - Cookbook sre.discovery.service-route [07:25:31] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) [07:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:37] !log oblivian@cumin1001 START - Cookbook sre.discovery.service-route [07:25:37] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) [07:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:43] hehe [07:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:58] ok, at least it's relatively easy to fix [07:33:12] (03PS1) 10Daniel Kinzler: Check that the timestamp key/value is set to avoid undefined offset [extensions/Collection] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730580 (https://phabricator.wikimedia.org/T293300) [07:40:17] (03CR) 10Ema: [C: 03+1] "Very nice!" [puppet] - 10https://gerrit.wikimedia.org/r/730016 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [07:40:37] (03CR) 10Alexandros Kosiaris: admin/otrs: create new root admin group vrts-roots, add Arnold (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/728648 (owner: 10Dzahn) [07:44:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Degraded RAID on backup1002 - https://phabricator.wikimedia.org/T292329 (10jcrespo) ` $ megacli -PDRbld -ShowProg -PhysDrv '[251:4]' -aALL Rebuild Progress on Device at Enclosure 251, Slot 4 Completed 97% in 11 M... [07:45:18] (03PS1) 10Giuseppe Lavagetto: sre.discovery: use CNAME records for swift dns lookup [cookbooks] - 10https://gerrit.wikimedia.org/r/730692 [07:45:31] Emperor: if you're curious ^^ [07:46:52] TY [07:47:53] (03CR) 10jerkins-bot: [V: 04-1] sre.discovery: use CNAME records for swift dns lookup [cookbooks] - 10https://gerrit.wikimedia.org/r/730692 (owner: 10Giuseppe Lavagetto) [07:48:37] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:49:27] ofc [07:50:41] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:51:06] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: disable tags support [puppet] - 10https://gerrit.wikimedia.org/r/729968 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [07:59:55] looks like the swift traffic took ~2Gbps out of the link [08:07:18] (03PS6) 10Jbond: apt: add a service description for apt to allow DNS discovery [puppet] - 10https://gerrit.wikimedia.org/r/730523 [08:07:28] (03CR) 10Jbond: apt: add a service description for apt to allow DNS discovery (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730523 (owner: 10Jbond) [08:18:50] (03PS2) 10Jbond: systemd::sysuser: also manage the group if we have a uid:gid id [puppet] - 10https://gerrit.wikimedia.org/r/730627 [08:19:01] RECOVERY - MegaRAID on backup1002 is OK: OK: optimal, 1 logical, 12 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:22:24] !log rolling out debmonitor-client upgrade to 0.3.1 across the fleet [08:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Degraded RAID on backup1002 - https://phabricator.wikimedia.org/T292329 (10jcrespo) ` RECOVERY - MegaRAID on backup1002 is OK: OK: optimal, 1 logical, 12 ` Thanks, @Jclark-ctr ! [08:40:36] (03PS1) 10Jbond: P:prometheus::node_exporter: update node_exporter to a profile [puppet] - 10https://gerrit.wikimedia.org/r/730700 [08:41:09] (03CR) 10Jbond: [V: 03+1] P:base: move production specific code to their own profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [08:42:01] (03CR) 10Jbond: [C: 03+2] systemd::sysuser: also manage the group if we have a uid:gid id [puppet] - 10https://gerrit.wikimedia.org/r/730627 (owner: 10Jbond) [08:43:23] (03PS8) 10Jbond: gitlab::ssh explicitly add git user with fixed id [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [08:43:53] (03CR) 10Jbond: [C: 03+1] gitlab::ssh explicitly add git user with fixed id (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [08:44:51] (03CR) 10Volans: sre.discovery: use CNAME records for swift dns lookup (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/730692 (owner: 10Giuseppe Lavagetto) [08:45:36] (03PS2) 10Jbond: P:prometheus::node_exporter: update node_exporter to a profile [puppet] - 10https://gerrit.wikimedia.org/r/730700 [08:46:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31657/console" [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [08:48:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31658/console" [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [08:48:08] (03CR) 10Jbond: P:prometheus::node_exporter: update node_exporter to a profile [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [08:49:02] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Add systemd based watchdog support [software/acme-chief] - 10https://gerrit.wikimedia.org/r/728379 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [08:51:07] !log volans@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [08:51:07] !log volans@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [08:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:31] (03PS1) 10Vgutierrez: Release 0.33 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/730703 (https://phabricator.wikimedia.org/T292619) [08:52:14] !log volans@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [08:52:14] !log volans@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [08:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:53] (03Merged) 10jenkins-bot: acme_chief: Add systemd based watchdog support [software/acme-chief] - 10https://gerrit.wikimedia.org/r/728379 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [08:54:58] (03PS1) 10Jbond: P:standard: drop unused code/parameters [puppet] - 10https://gerrit.wikimedia.org/r/730704 [08:55:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31659/console" [puppet] - 10https://gerrit.wikimedia.org/r/730704 (owner: 10Jbond) [08:57:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31660/console" [puppet] - 10https://gerrit.wikimedia.org/r/730704 (owner: 10Jbond) [08:59:03] (03PS1) 10Volans: Release v0.3.1 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/730706 [08:59:23] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:standard: drop unused code/parameters [puppet] - 10https://gerrit.wikimedia.org/r/730704 (owner: 10Jbond) [09:00:06] (03CR) 10Jbond: [C: 03+1] Release v0.3.1 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/730706 (owner: 10Volans) [09:00:35] (03CR) 10Volans: [V: 03+2 C: 03+2] Release v0.3.1 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/730706 (owner: 10Volans) [09:01:57] (03PS40) 10Jbond: P:base: move production specific code to their own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [09:02:19] !log volans@deploy1002 Started deploy [debmonitor/deploy@444b931]: Release v0.3.1 [09:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:41] !log volans@deploy1002 Finished deploy [debmonitor/deploy@444b931]: Release v0.3.1 (duration: 00m 23s) [09:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:41] (03CR) 10Vgutierrez: [C: 03+2] Release 0.33 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/730703 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:03:46] !log volans@deploy1002 Started deploy [debmonitor/deploy@444b931]: Release v0.3.1 [09:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:31] !log volans@deploy1002 Finished deploy [debmonitor/deploy@444b931]: Release v0.3.1 (duration: 00m 45s) [09:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:43] (03PS12) 10Arturo Borrero Gonzalez: cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) [09:07:54] (03Merged) 10jenkins-bot: Release 0.33 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/730703 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:11:34] (03PS1) 10MMandere: puppetmaster: Add drmrs DC Site [puppet] - 10https://gerrit.wikimedia.org/r/730707 (https://phabricator.wikimedia.org/T282787) [09:13:01] (03PS1) 10Volans: Release v0.3.1 (take 2) [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/730708 [09:13:16] (03PS1) 10Filippo Giunchedi: pontoon: skip certificate generation during bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/730709 [09:13:47] (03CR) 10jerkins-bot: [V: 04-1] pontoon: skip certificate generation during bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/730709 (owner: 10Filippo Giunchedi) [09:14:10] (03PS1) 10MVernon: codfw-prod: more weight to ms-be2045 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/730710 (https://phabricator.wikimedia.org/T290881) [09:14:14] (03CR) 10Jbond: [C: 03+1] Release v0.3.1 (take 2) [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/730708 (owner: 10Volans) [09:14:16] (03PS1) 10Vgutierrez: acme_chief: Add systemd based watchdog support [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/730711 (https://phabricator.wikimedia.org/T292619) [09:14:18] (03PS1) 10Vgutierrez: Release 0.33 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/730712 (https://phabricator.wikimedia.org/T292619) [09:14:20] (03PS1) 10Vgutierrez: debian: Add release 0.33 to the changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/730713 (https://phabricator.wikimedia.org/T292619) [09:15:50] (03CR) 10Volans: [V: 03+2 C: 03+2] Release v0.3.1 (take 2) [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/730708 (owner: 10Volans) [09:17:27] !log volans@deploy1002 Started deploy [debmonitor/deploy@ab62ac5]: Release v0.3.1 [09:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:17] !log volans@deploy1002 Finished deploy [debmonitor/deploy@ab62ac5]: Release v0.3.1 (duration: 00m 50s) [09:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:59] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/730707 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [09:19:08] !log volans@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [09:19:08] !log volans@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [09:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:37] !log volans@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [09:19:37] !log volans@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [09:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:54] !log volans@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [09:20:54] !log volans@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [09:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:16] (03PS1) 10Jbond: P:standard: move profile::mail::default_mail_relay to profile::base [puppet] - 10https://gerrit.wikimedia.org/r/730714 [09:23:24] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31663/console" [puppet] - 10https://gerrit.wikimedia.org/r/728457 (owner: 10David Caro) [09:23:36] (03CR) 10MMandere: [C: 03+2] puppetmaster: Add drmrs DC Site [puppet] - 10https://gerrit.wikimedia.org/r/730707 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [09:24:06] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Add systemd based watchdog support [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/730711 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:24:12] (03CR) 10Vgutierrez: [C: 03+2] Release 0.33 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/730712 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:24:16] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.33 to the changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/730713 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:24:37] PROBLEM - Check systemd state on search-loader2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_mjolnir-kafka-bulk-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:41] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [09:25:51] (03CR) 10David Caro: [V: 03+1 C: 03+2] base::sysctl::core_dumps: move core_dumps to their own class [puppet] - 10https://gerrit.wikimedia.org/r/728457 (owner: 10David Caro) [09:25:59] (03PS5) 10David Caro: base::sysctl::core_dumps: move core_dumps to their own class [puppet] - 10https://gerrit.wikimedia.org/r/728457 [09:26:47] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [09:27:26] (03Merged) 10jenkins-bot: acme_chief: Add systemd based watchdog support [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/730711 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:27:31] (03PS2) 10Jbond: P:standard: move profile::mail::default_mail_relay to profile::base [puppet] - 10https://gerrit.wikimedia.org/r/730714 [09:27:43] (03Merged) 10jenkins-bot: Release 0.33 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/730712 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:27:45] (03Merged) 10jenkins-bot: debian: Add release 0.33 to the changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/730713 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:28:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31665/console" [puppet] - 10https://gerrit.wikimedia.org/r/730714 (owner: 10Jbond) [09:28:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31664/console" [puppet] - 10https://gerrit.wikimedia.org/r/730714 (owner: 10Jbond) [09:29:39] (03PS3) 10Jbond: P:standard: move profile::mail::default_mail_relay to profile::base [puppet] - 10https://gerrit.wikimedia.org/r/730714 [09:30:12] (03PS4) 10Jbond: P:standard: move profile::mail::default_mail_relay to profile::base [puppet] - 10https://gerrit.wikimedia.org/r/730714 [09:30:41] (03PS1) 10Muehlenhoff: Sync more content of Hiera contact information and owners.yaml [puppet] - 10https://gerrit.wikimedia.org/r/730716 [09:32:32] (03CR) 10Muehlenhoff: [C: 03+2] Sync more content of Hiera contact information and owners.yaml [puppet] - 10https://gerrit.wikimedia.org/r/730716 (owner: 10Muehlenhoff) [09:33:14] (03PS2) 10Filippo Giunchedi: pontoon: skip certificate generation during bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/730709 [09:34:18] it is storming incredibly here. if I disappear it will be the power. [09:34:27] PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:38:53] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:39:05] (03PS1) 10MMandere: acme_chief: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/730717 (https://phabricator.wikimedia.org/T282787) [09:39:59] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:40:19] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: skip certificate generation during bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/730709 (owner: 10Filippo Giunchedi) [09:40:21] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:44:27] (03CR) 10David Caro: P:prometheus::node_exporter: update node_exporter to a profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [09:49:52] (03PS11) 10Jgiannelos: Add script to send tile invalidation events [puppet] - 10https://gerrit.wikimedia.org/r/722825 (https://phabricator.wikimedia.org/T270175) [09:53:38] (03PS3) 10Jbond: P:prometheus::node_exporter: update node_exporter to a profile [puppet] - 10https://gerrit.wikimedia.org/r/730700 [09:54:35] (03PS1) 10Urbanecm: UncachedMenteeOverviewDataProvider: Reset state before calculating [extensions/GrowthExperiments] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730583 (https://phabricator.wikimedia.org/T290609) [09:54:46] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! I think we're also ok to go 4000 here, I'm +1'ing because either is fine" [software/swift-ring] - 10https://gerrit.wikimedia.org/r/730710 (https://phabricator.wikimedia.org/T290881) (owner: 10MVernon) [09:54:49] (03PS1) 10Urbanecm: UncachedMenteeOverviewDataProvider: Reset state before calculating [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730584 (https://phabricator.wikimedia.org/T290609) [09:55:01] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:55:04] (03PS1) 10Urbanecm: updateMenteeData: Summarize profiling data for all mentors [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730585 (https://phabricator.wikimedia.org/T290609) [09:55:12] (03PS1) 10Urbanecm: updateMenteeData: Summarize profiling data for all mentors [extensions/GrowthExperiments] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730726 (https://phabricator.wikimedia.org/T290609) [09:55:15] (03CR) 10Jbond: "thanks see inline" [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [09:55:36] (03CR) 10Vgutierrez: [C: 03+1] acme_chief: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/730717 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [09:56:23] (03PS2) 10Urbanecm: updateMenteeData: Summarize profiling data for all mentors [extensions/GrowthExperiments] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730726 (https://phabricator.wikimedia.org/T290609) [09:57:11] (03PS1) 10Urbanecm: updateMenteeData: Switch profiling to microsecond precision [extensions/GrowthExperiments] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730718 (https://phabricator.wikimedia.org/T290609) [09:57:19] jouncebot: nowandnext [09:57:19] No deployments scheduled for the next 0 hour(s) and 2 minute(s) [09:57:19] In 0 hour(s) and 2 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211014T1000) [09:57:55] (03CR) 10Urbanecm: [C: 03+2] UncachedMenteeOverviewDataProvider: Reset state before calculating [extensions/GrowthExperiments] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730583 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [09:57:59] (03CR) 10Urbanecm: [C: 03+2] updateMenteeData: Summarize profiling data for all mentors [extensions/GrowthExperiments] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730726 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [09:58:02] (03CR) 10Urbanecm: [C: 03+2] updateMenteeData: Switch profiling to microsecond precision [extensions/GrowthExperiments] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730718 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [09:58:28] (03PS2) 10Urbanecm: updateMenteeData: Summarize profiling data for all mentors [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730585 (https://phabricator.wikimedia.org/T290609) [09:58:31] (03CR) 10Urbanecm: [C: 03+2] UncachedMenteeOverviewDataProvider: Reset state before calculating [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730584 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [09:58:34] (03CR) 10Urbanecm: [C: 03+2] updateMenteeData: Summarize profiling data for all mentors [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730585 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [09:59:30] (03CR) 10David Caro: P:prometheus::node_exporter: update node_exporter to a profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [09:59:33] (03PS1) 10Urbanecm: updateMenteeData: Switch profiling to microsecond precision [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730720 (https://phabricator.wikimedia.org/T290609) [09:59:54] (03PS2) 10Urbanecm: updateMenteeData: Switch profiling to microsecond precision [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730720 (https://phabricator.wikimedia.org/T290609) [10:00:00] (03CR) 10Urbanecm: [C: 03+2] updateMenteeData: Switch profiling to microsecond precision [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730720 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [10:00:04] mvolz: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211014T1000). [10:05:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, modulo the points already made" [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [10:05:08] (03PS12) 10Jgiannelos: Add script to send tile invalidation events [puppet] - 10https://gerrit.wikimedia.org/r/722825 (https://phabricator.wikimedia.org/T270175) [10:06:36] (03CR) 10Jgiannelos: Add script to send tile invalidation events (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/722825 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [10:07:48] (03CR) 10David Caro: P:prometheus::node_exporter: update node_exporter to a profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [10:10:05] (03CR) 10David Caro: P:prometheus::node_exporter: update node_exporter to a profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [10:10:29] (03CR) 10MVernon: [V: 03+2 C: 03+2] codfw-prod: more weight to ms-be2045 (031 comment) [software/swift-ring] - 10https://gerrit.wikimedia.org/r/730710 (https://phabricator.wikimedia.org/T290881) (owner: 10MVernon) [10:11:33] (03PS1) 10Elukey: Add the KServe images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/730722 (https://phabricator.wikimedia.org/T293331) [10:11:55] (03CR) 10jerkins-bot: [V: 04-1] updateMenteeData: Summarize profiling data for all mentors [extensions/GrowthExperiments] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730726 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [10:12:01] :( [10:14:09] what a jerk this bot is [10:14:16] (03CR) 10Filippo Giunchedi: [C: 03+1] P:prometheus::node_exporter: update node_exporter to a profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [10:14:16] the jerkiest [10:15:15] (03CR) 10MMandere: [C: 03+2] acme_chief: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/730717 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [10:15:37] (03CR) 10Ema: [C: 03+1] "Very useful! One nit about the check otherwise +1." [puppet] - 10https://gerrit.wikimedia.org/r/730523 (owner: 10Jbond) [10:19:57] (03PS1) 10Jcrespo: mediabackups: Backup frwiki media on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/730723 (https://phabricator.wikimedia.org/T262668) [10:25:14] (03PS13) 10Arturo Borrero Gonzalez: cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) [10:27:59] (03CR) 10Ema: apt: add a service description for apt to allow DNS discovery (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/730524 (owner: 10Jbond) [10:28:04] (03PS2) 10Jcrespo: mediabackups: Backup frwiki media on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/730723 (https://phabricator.wikimedia.org/T262668) [10:28:52] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Backup frwiki media on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/730723 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [10:29:36] (03PS1) 10MMandere: bastionhost: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/730724 (https://phabricator.wikimedia.org/T282787) [10:29:50] (03CR) 10Urbanecm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/730243 (https://phabricator.wikimedia.org/T279436) (owner: 10Dylsss) [10:30:10] (03Merged) 10jenkins-bot: UncachedMenteeOverviewDataProvider: Reset state before calculating [extensions/GrowthExperiments] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730583 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [10:30:13] (03Merged) 10jenkins-bot: updateMenteeData: Summarize profiling data for all mentors [extensions/GrowthExperiments] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730726 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [10:30:15] (03Merged) 10jenkins-bot: updateMenteeData: Switch profiling to microsecond precision [extensions/GrowthExperiments] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730718 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [10:30:18] (03Merged) 10jenkins-bot: UncachedMenteeOverviewDataProvider: Reset state before calculating [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730584 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [10:30:21] (03Merged) 10jenkins-bot: updateMenteeData: Summarize profiling data for all mentors [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730585 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [10:30:24] (03Merged) 10jenkins-bot: updateMenteeData: Switch profiling to microsecond precision [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730720 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [10:30:25] finally [10:30:39] (03CR) 10jerkins-bot: [V: 04-1] Dumps: Clarify licensing for Wikidata and update various links [puppet] - 10https://gerrit.wikimedia.org/r/730243 (https://phabricator.wikimedia.org/T279436) (owner: 10Dylsss) [10:32:05] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm2001.codfw.wmnet [10:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:21] (03PS14) 10Arturo Borrero Gonzalez: cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) [10:33:24] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.3/extensions/GrowthExperiments/: 465b564, a8cc98b, 6e95c48: GrowthExperiments backports (T290609) (duration: 01m 06s) [10:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:30] T290609: Make mentee overview module's updateMenteeData.php scale better - https://phabricator.wikimedia.org/T290609 [10:33:59] 10SRE, 10ops-drmrs, 10DNS, 10Infrastructure-Foundations, and 2 others: setup drmrs mgmt & private prefixs - question on switch status - https://phabricator.wikimedia.org/T293294 (10cmooney) > I'm thinking 10.136.0.0/16 for the site seems logical, with 10.136.128.0/17 for mgmt. Keep Vlan 900 for that also I... [10:35:16] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.4/extensions/GrowthExperiments/: 1f33fc3, e0ea1b8, cba2ac9: GrowthExperiments backports (T290609) (duration: 01m 05s) [10:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2001.codfw.wmnet [10:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:19] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testvm2001.codfw.wmnet` - testvm2001.codfw.wmnet (**PASS**) - Downtimed host on Icing... [10:38:33] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm2002.codfw.wmnet [10:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:54] (03PS4) 10Jcrespo: updatementeedata.pp: Update script parameters [puppet] - 10https://gerrit.wikimedia.org/r/728656 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [10:40:24] (03PS15) 10Arturo Borrero Gonzalez: cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) [10:42:47] (03CR) 10Jcrespo: [C: 03+2] updatementeedata.pp: Update script parameters [puppet] - 10https://gerrit.wikimedia.org/r/728656 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [10:49:22] (03PS16) 10Arturo Borrero Gonzalez: cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) [10:51:09] (03PS17) 10Arturo Borrero Gonzalez: cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) [10:52:15] (03PS1) 10Lucas Werkmeister (WMDE): Set wmgWikibaseDispatchViaJobsPruneChangesTableInJobEnabled for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730725 (https://phabricator.wikimedia.org/T291828) [10:52:17] (03PS1) 10Lucas Werkmeister (WMDE): Untangle “dispatch via jobs” settings in Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730746 (https://phabricator.wikimedia.org/T291828) [10:52:19] (03PS1) 10Lucas Werkmeister (WMDE): Set dispatchViaJobsAllowedClients to null everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730747 (https://phabricator.wikimedia.org/T291828) [10:52:21] (03PS1) 10Lucas Werkmeister (WMDE): Remove $wmgWikibaseDispatchViaJobsAllowedClients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730748 (https://phabricator.wikimedia.org/T291828) [10:52:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2002.codfw.wmnet [10:52:28] ^ we gotta have something to do during the training, don’t we ;) [10:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:38] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testvm2002.codfw.wmnet` - testvm2002.codfw.wmnet (**PASS**) - Downtimed host on Icing... [10:52:39] (none of those are urgent, let’s see how the training goes) [10:54:12] (03PS1) 10Vgutierrez: acme_chief: auto-detect systemd watchdog [software/acme-chief] - 10https://gerrit.wikimedia.org/r/730749 (https://phabricator.wikimedia.org/T292619) [10:54:35] (03CR) 10Michael Große: [C: 03+1] "Could also include commons wiki in principle, but since that does not have any changes in wb_changes, this is fine as well" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730725 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [10:54:56] (03PS4) 10Jbond: P:prometheus::node_exporter: update node_exporter to a profile [puppet] - 10https://gerrit.wikimedia.org/r/730700 [10:54:58] (03PS1) 10Jbond: P:prometheus::node_exporter: update node_exporter to a profile [puppet] - 10https://gerrit.wikimedia.org/r/730750 [10:55:23] (03CR) 10Jbond: [C: 03+2] "updated i will just merge this one as its a no-op" [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [10:55:26] lol [10:55:45] (03CR) 10Michael Große: [C: 03+1] "thank you for cleaning all that up!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730746 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [10:56:25] (03CR) 10jerkins-bot: [V: 04-1] P:prometheus::node_exporter: update node_exporter to a profile [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [10:56:28] we have one person signed up for training, no patches in the window, the procedure can be talked through and if the trainee has access rights to the hosts they can at least run git status and look at the various directories [10:56:39] (03CR) 10Michael Große: [C: 03+1] Set dispatchViaJobsAllowedClients to null everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730747 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [10:56:56] (03CR) 10Michael Große: [C: 03+1] Remove $wmgWikibaseDispatchViaJobsAllowedClients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730748 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [10:56:57] there's lots of ground to cover, but as you say let's see how it goes, I don't know what access to the cluster they will have [10:57:28] warning that we have a really powerful storm here which could knock out power, it's going to last through tomorrow (!). so if I abruptly disappear, that will be the reason [10:57:49] (03PS5) 10Jbond: P:prometheus::node_exporter: update node_exporter to a profile [puppet] - 10https://gerrit.wikimedia.org/r/730700 [10:57:49] ok, best of luck with that 😬 [10:57:51] (03PS18) 10Arturo Borrero Gonzalez: cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) [10:57:55] (03CR) 10Jbond: [C: 03+2] P:prometheus::node_exporter: update node_exporter to a profile [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [10:58:22] thanks, there's some low level flooding in other parts of the city but we're elevated so we won't see that [10:58:59] correction: there are now 4 patches in the window :-D :-D [10:59:11] :D [10:59:31] tbh they’re not great patches for training, something simple in IS.php would be better [10:59:50] I can also deploy them later in the afternoon [10:59:57] (03PS1) 10Arturo Borrero Gonzalez: hieradata: add profile::openstack::base::cinder::db_pass placeholder [labs/private] - 10https://gerrit.wikimedia.org/r/730751 (https://phabricator.wikimedia.org/T292546) [11:00:04] Amir1, Lucas_WMDE, and apergos: How many deployers does it take to do UTC morning backport and config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211014T1100). [11:00:04] Lucas_WMDE: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] o/ [11:00:38] is there a call I should join? [11:00:51] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hieradata: add profile::openstack::base::cinder::db_pass placeholder [labs/private] - 10https://gerrit.wikimedia.org/r/730751 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [11:01:07] (I guess these changes could be a good way to teach “if you don’t feel comfortable with it, don’t deploy it” ;) ) [11:01:27] (03PS1) 10Urbanecm: growthexperiments: Run refreshLinkRecommendations in parallel [puppet] - 10https://gerrit.wikimedia.org/r/730752 (https://phabricator.wikimedia.org/T278103) [11:01:57] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:59] (03CR) 10Jbond: [C: 03+2] P:prometheus::node_exporter: update node_exporter to a profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [11:03:25] (03PS2) 10Jbond: P:prometheus::node_exporter: update node_exporter to a profile [puppet] - 10https://gerrit.wikimedia.org/r/730750 [11:04:04] (03PS2) 10Urbanecm: growthexperiments: Run refreshLinkRecommendations in parallel [puppet] - 10https://gerrit.wikimedia.org/r/730752 (https://phabricator.wikimedia.org/T278103) [11:04:29] (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/730752 (https://phabricator.wikimedia.org/T278103) (owner: 10Urbanecm) [11:04:51] (03CR) 10Jbond: P:prometheus::node_exporter: update node_exporter to a profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [11:04:53] (03CR) 10jerkins-bot: [V: 04-1] P:prometheus::node_exporter: update node_exporter to a profile [puppet] - 10https://gerrit.wikimedia.org/r/730750 (owner: 10Jbond) [11:05:02] (03CR) 10David Caro: [C: 03+1] "Wait for the next patch though, we have to tweak the hiera on horizon before that goes on." [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [11:05:10] (03PS1) 10Arturo Borrero Gonzalez: hieradata: add placeholder for profile::openstack::base::nova::rabbit_pass [labs/private] - 10https://gerrit.wikimedia.org/r/730753 (https://phabricator.wikimedia.org/T292546) [11:05:23] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hieradata: add placeholder for profile::openstack::base::nova::rabbit_pass [labs/private] - 10https://gerrit.wikimedia.org/r/730753 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [11:05:48] if any other deployer wants to sit in on the training, feel free to join the google meet [11:06:30] I don’t think I have the google meet yet [11:09:12] (otherwise I’d be interested in joining, sure) [11:09:30] the link, Lucas_WMDE, is on the google calendar entry for the backport window [11:10:12] our trainee does not have deployment rights yet so this will be a descriptiive talk through rather than a walk through. [11:10:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:46] ohh, now I found it in the calendar entry [11:10:48] thanks [11:10:58] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm2006.codfw.wmnet [11:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:29] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:14:43] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:14:45] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:14:47] (03PS19) 10Arturo Borrero Gonzalez: cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) [11:15:05] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:16:22] (03PS1) 10Elukey: Add the kserve and kserve-inference charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/730761 (https://phabricator.wikimedia.org/T293331) [11:16:24] (03PS1) 10Elukey: Move the ml-serve cluster to KServe [deployment-charts] - 10https://gerrit.wikimedia.org/r/730762 (https://phabricator.wikimedia.org/T293331) [11:17:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2006.codfw.wmnet [11:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:05] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testvm2006.codfw.wmnet` - testvm2006.codfw.wmnet (**WARN**) - //Host not found on Ici... [11:17:38] (03PS20) 10Arturo Borrero Gonzalez: cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) [11:17:57] (03CR) 10jerkins-bot: [V: 04-1] Move the ml-serve cluster to KServe [deployment-charts] - 10https://gerrit.wikimedia.org/r/730762 (https://phabricator.wikimedia.org/T293331) (owner: 10Elukey) [11:21:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/31676/" [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [11:22:15] RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:24:38] (03CR) 10Jbond: [C: 03+2] P:standard: move profile::mail::default_mail_relay to profile::base [puppet] - 10https://gerrit.wikimedia.org/r/730714 (owner: 10Jbond) [11:27:30] (03PS1) 10Volans: tox: add support for Python 3.9 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730766 [11:27:33] (03PS1) 10Volans: scripts: add support for new VLAN names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730767 [11:31:35] (03PS1) 10Arturo Borrero Gonzalez: openstack: cinder backups: use per-deployment DB pass [puppet] - 10https://gerrit.wikimedia.org/r/730768 (https://phabricator.wikimedia.org/T292546) [11:34:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/31681/" [puppet] - 10https://gerrit.wikimedia.org/r/730768 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [11:35:15] (03CR) 10Jbond: P:prometheus::node_exporter: update node_exporter to a profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [11:37:24]  [11:37:33] (03PS1) 10Arturo Borrero Gonzalez: openstack: cinder backups: use per-deployment rabbit pass Each deployment has a different rabbit pass. We should read them from hiera accordingly. Bug: T292546 Signed-off-by: Arturo Borrero Gonzalez Change-Id: I5014ac48caf88b3d8d576a0db2e7f56ed8f6a3d5 [puppet] - 10https://gerrit.wikimedia.org/r/730769 (https://phabricator.wikimedia.org/T292546) [11:38:02] (03CR) 10jerkins-bot: [V: 04-1] openstack: cinder backups: use per-deployment rabbit pass Each deployment has a different rabbit pass. We should read them from hiera accordingly. Bug: T292546 Signed-off-by: Arturo Borrero Gonzalez Change-Id: I5014ac48caf88b3d8d576a0db2e7f56ed8f6a3d5 [puppet] - 10https://gerrit.wikimedia.org/r/730769 (https://phabricator.wikimedia.org/T292546) (owner: 1 [11:39:08] (03CR) 10Lucas Werkmeister (WMDE): "recheck to make something appear in Zuul" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730725 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [11:39:18] (03PS2) 10Arturo Borrero Gonzalez: openstack: cinder backups: use per-deployment rabbit pass [puppet] - 10https://gerrit.wikimedia.org/r/730769 (https://phabricator.wikimedia.org/T292546) [11:39:53] (03CR) 10jerkins-bot: [V: 04-1] openstack: cinder backups: use per-deployment rabbit pass [puppet] - 10https://gerrit.wikimedia.org/r/730769 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [11:39:55] (03PS7) 10Jbond: apt: add a service description for apt to allow DNS discovery [puppet] - 10https://gerrit.wikimedia.org/r/730523 [11:40:52] (03PS3) 10Arturo Borrero Gonzalez: openstack: cinder backups: use per-deployment rabbit pass [puppet] - 10https://gerrit.wikimedia.org/r/730769 (https://phabricator.wikimedia.org/T292546) [11:43:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1003/31682/" [puppet] - 10https://gerrit.wikimedia.org/r/730769 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [11:44:54] (03CR) 10Jbond: apt: add a service description for apt to allow DNS discovery (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730523 (owner: 10Jbond) [11:45:57] (03CR) 10Filippo Giunchedi: P:prometheus::node_exporter: update node_exporter to a profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [11:46:49] PROBLEM - ganeti-mond running on ganeti2026 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [11:47:21] ^ ganeti2026 is expected, silencing [11:47:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2026.codfw.wmnet with OS buster [11:47:49] PROBLEM - ganeti-noded running on ganeti2026 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [11:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:54] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS buster [11:49:03] (03PS9) 10Jelto: gitlab::ssh explicitly add git user with fixed id [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) [11:51:06] (03PS5) 10Jbond: apt: add a service description for apt to allow DNS discovery [dns] - 10https://gerrit.wikimedia.org/r/730524 [11:51:43] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31683/console" [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [11:53:09] (03CR) 10Jbond: apt: add a service description for apt to allow DNS discovery (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/730524 (owner: 10Jbond) [11:54:10] (03PS1) 10Arturo Borrero Gonzalez: hieradata: cloudbackup2002: fix typo in LVM volue group name [puppet] - 10https://gerrit.wikimedia.org/r/730771 (https://phabricator.wikimedia.org/T292546) [11:54:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: cloudbackup2002: fix typo in LVM volue group name [puppet] - 10https://gerrit.wikimedia.org/r/730771 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [11:55:56] (03CR) 10Jbond: apt: add a service description for apt to allow DNS discovery (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/730524 (owner: 10Jbond) [11:56:38] (03CR) 10Jbond: P:prometheus::node_exporter: update node_exporter to a profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [11:57:16] (03CR) 10Jbond: [C: 03+1] gitlab::ssh explicitly add git user with fixed id [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [11:58:13] ok, I realize the window is almost over Lucas, so not sure what you want to do about the config changes [11:58:22] I’ll just leave them around for now [11:58:24] heh [11:58:35] the one that's merged ought to go through though [11:58:36] or maybe deploy them over the next 1½ hours, there’s a meeting I only need to pay half attention to I think [11:58:46] we didn’t merge anything, did we? [11:58:52] I only rechecked one to make it show up in Zuul [11:58:55] but didn’t +2 it [11:58:57] ah recheck [11:59:00] perfect [11:59:06] (03CR) 10Jelto: [V: 03+1] gitlab::ssh explicitly add git user with fixed id (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [11:59:21] in that case, at your leisure, thanks for the assist in the training, that was great [11:59:34] :) [11:59:38] jouncebot: nowandnext [11:59:38] For the next 0 hour(s) and 0 minute(s): UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211014T1100) [11:59:38] In 4 hour(s) and 0 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211014T1600) [11:59:46] yeah, nice big break for self-servicing ^^ [11:59:50] good! [12:00:04] I like the 0 hours and 0 minutes notification. come on bot, you can add better than that [12:01:27] (03PS41) 10Jbond: P:base: move production specific code to their own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [12:02:07] (03CR) 10jerkins-bot: [V: 04-1] P:base: move production specific code to their own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:03:33] (03PS42) 10Jbond: P:base: move production specific code to their own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [12:03:51] (03PS43) 10Jbond: P:base: move production specific code to their own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [12:04:24] (03CR) 10jerkins-bot: [V: 04-1] P:base: move production specific code to their own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:06:58] (03CR) 10Kosta Harlan: [C: 03+1] "Neat. Thanks! We'll have to keep an eye on the service but it should be able to handle the parallelization of the requests. Maybe we shoul" [puppet] - 10https://gerrit.wikimedia.org/r/730752 (https://phabricator.wikimedia.org/T278103) (owner: 10Urbanecm) [12:07:51] if nobody else is deploying right now, I’ll self-service the first of those four Wikibase config changes in the UTC morning window [12:08:04] should be a complete no-op, I’ll verify that via shell.php on mwdebug [12:08:30] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Set wmgWikibaseDispatchViaJobsPruneChangesTableInJobEnabled for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730725 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [12:09:56] (03CR) 10Urbanecm: growthexperiments: Run refreshLinkRecommendations in parallel (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730752 (https://phabricator.wikimedia.org/T278103) (owner: 10Urbanecm) [12:10:06] (03Merged) 10jenkins-bot: Set wmgWikibaseDispatchViaJobsPruneChangesTableInJobEnabled for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730725 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [12:11:02] looks good, syncing [12:11:21] (03CR) 10Urbanecm: growthexperiments: Run refreshLinkRecommendations in parallel (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730752 (https://phabricator.wikimedia.org/T278103) (owner: 10Urbanecm) [12:12:10] in case anyone’s looking at logspam-watch, those psysh errors are from me, I’m running several shell.php to test those changes [12:12:21] see https://phabricator.wikimedia.org/T248802 [12:12:34] harmless (but currently #3 in the list, hence mentioning) [12:12:47] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:730725|Set wmgWikibaseDispatchViaJobsPruneChangesTableInJobEnabled for wikidatawiki (T291828)]] (no-op) (duration: 01m 05s) [12:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:56] T291828: Remove transitionary Dispatch Config - https://phabricator.wikimedia.org/T291828 [12:13:56] (03PS1) 10Arturo Borrero Gonzalez: openstack: cinder backups: create directory for mount [puppet] - 10https://gerrit.wikimedia.org/r/730776 (https://phabricator.wikimedia.org/T292546) [12:15:41] next change should also be a no-op [12:15:51] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Untangle “dispatch via jobs” settings in Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730746 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [12:16:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/31684/" [puppet] - 10https://gerrit.wikimedia.org/r/730776 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [12:16:53] (03Merged) 10jenkins-bot: Untangle “dispatch via jobs” settings in Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730746 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [12:17:22] testing on mwdebug1002 again [12:17:26] (expect more psysh errors) [12:17:47] looks fine, syncing [12:18:00] 10SRE, 10ops-drmrs, 10DNS, 10Infrastructure-Foundations, and 2 others: setup drmrs mgmt & private prefixs - question on switch status - https://phabricator.wikimedia.org/T293294 (10MMandere) Thank you @cmooney for the inclusion. The assignments are fine by me. I will, however, need to further consult wi... [12:18:27] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:19:04] two down two to go? [12:19:15] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:730746|Untangle “dispatch via jobs” settings in Wikibase.php (T291828)]] (no-op) (duration: 01m 04s) [12:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:22] T291828: Remove transitionary Dispatch Config - https://phabricator.wikimedia.org/T291828 [12:19:30] yeah but I might wait with the other two, not sure [12:19:45] the first two just shuffled things around with no effective change to the globals at the end [12:19:49] the next two actually change something ^^ [12:19:58] even though it shouldn’t affect much [12:20:05] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:25] if anyone else wants to deploy in the meantime, feel free to, I’ll say something again if I decide to proceed [12:20:41] (03CR) 10Ayounsi: [C: 03+1] tox: add support for Python 3.9 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730766 (owner: 10Volans) [12:24:01] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:25:32] (brb restarting my IRC client) [12:26:14] (03CR) 10Volans: [C: 03+2] tox: add support for Python 3.9 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730766 (owner: 10Volans) [12:26:54] (03Merged) 10jenkins-bot: tox: add support for Python 3.9 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730766 (owner: 10Volans) [12:30:28] 10SRE, 10ops-drmrs, 10DNS, 10Infrastructure-Foundations, and 2 others: setup drmrs mgmt & private prefixs - question on switch status - https://phabricator.wikimedia.org/T293294 (10BBlack) >>! In T293294#7427708, @cmooney wrote: > We've two private subnets assigned, one for each rack/switch: > > https://n... [12:34:44] (03PS1) 10Arturo Borrero Gonzalez: openstack: cinder: allow backup API actions [puppet] - 10https://gerrit.wikimedia.org/r/730779 (https://phabricator.wikimedia.org/T292546) [12:35:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: cinder: allow backup API actions [puppet] - 10https://gerrit.wikimedia.org/r/730779 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [12:35:45] for the uncommitted changes I'm looking at the script as there might be an issue with new data for drmrs DC [12:37:57] (03PS1) 10Jbond: minor updates [puppet] - 10https://gerrit.wikimedia.org/r/730780 [12:38:23] (03PS2) 10Jbond: admin/utils: minor updates [puppet] - 10https://gerrit.wikimedia.org/r/730780 [12:39:24] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add the KServe images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/730722 (https://phabricator.wikimedia.org/T293331) (owner: 10Elukey) [12:42:12] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on cloudbackup2002.codfw.wmnet with reason: working on cinder backupse [12:42:14] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on cloudbackup2002.codfw.wmnet with reason: working on cinder backupse [12:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:04] (03CR) 10Jbond: [C: 03+2] admin/utils: minor updates [puppet] - 10https://gerrit.wikimedia.org/r/730780 (owner: 10Jbond) [12:43:45] (03PS2) 10Elukey: Add the kserve and kserve-inference charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/730761 (https://phabricator.wikimedia.org/T293331) [12:44:05] (03CR) 10Lucas Werkmeister (WMDE): "Currently, in production, the setting is:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730747 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [12:44:12] (03PS3) 10Urbanecm: growthexperiments: Run refreshLinkRecommendations in parallel [puppet] - 10https://gerrit.wikimedia.org/r/730752 (https://phabricator.wikimedia.org/T278103) [12:44:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/730724 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:44:51] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [12:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:05] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:45:19] (03PS10) 10ZPapierski: [WIP] Add kafka position transfer to wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/727021 (https://phabricator.wikimedia.org/T276469) [12:45:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2026.codfw.wmnet with OS buster [12:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:39] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS buster completed: - ganeti2026 (**PASS**) - Downtimed... [12:46:23] (03CR) 10Urbanecm: growthexperiments: Run refreshLinkRecommendations in parallel (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730752 (https://phabricator.wikimedia.org/T278103) (owner: 10Urbanecm) [12:46:46] (Primary inbound port utilisation over 80% #page) firing: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [12:46:46] (Primary inbound port utilisation over 80% #page) firing: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [12:46:55] again... [12:47:15] * akosiaris around [12:47:32] that's different [12:47:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:47:42] it's eqord->eqiad [12:47:49] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [12:47:49] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [12:48:17] er, eqiad->eqord [12:48:41] (03CR) 10David Caro: "When swapping the role for the new profile on tools, I get the error:" [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [12:49:34] * volans here if needed [12:49:47] here too [12:50:41] (03PS1) 10Arturo Borrero Gonzalez: openstack: galera: allow DB access to cinder-backup nodes [puppet] - 10https://gerrit.wikimedia.org/r/730782 (https://phabricator.wikimedia.org/T292546) [12:51:35] volans, godog, 302 _security [12:51:41] ack [12:53:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1003/31687/" [puppet] - 10https://gerrit.wikimedia.org/r/730782 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [12:54:15] (03CR) 10Kosta Harlan: [C: 03+1] growthexperiments: Run refreshLinkRecommendations in parallel (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730752 (https://phabricator.wikimedia.org/T278103) (owner: 10Urbanecm) [12:55:37] (03PS1) 10David Caro: p::prometheus::node_exporter: fix typo in docs [puppet] - 10https://gerrit.wikimedia.org/r/730783 [12:57:41] (03CR) 10MMandere: [C: 03+2] bastionhost: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/730724 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:59:29] PROBLEM - Check systemd state on ms-be2038 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:48] (03CR) 10David Caro: P:prometheus::node_exporter: update node_exporter to a profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [13:02:19] (03PS1) 10Arturo Borrero Gonzalez: openstack: cinder.conf: specify lock path [puppet] - 10https://gerrit.wikimedia.org/r/730784 (https://phabricator.wikimedia.org/T292546) [13:02:36] (03CR) 10David Caro: P:prometheus::node_exporter: update node_exporter to a profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730700 (owner: 10Jbond) [13:03:19] (03PS1) 10David Caro: standard::prometheus: include the profile instead of the role [puppet] - 10https://gerrit.wikimedia.org/r/730785 [13:03:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: cinder.conf: specify lock path [puppet] - 10https://gerrit.wikimedia.org/r/730784 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [13:03:44] (03PS2) 10BBlack: interface-rps.py: no-op format/comment fixups [puppet] - 10https://gerrit.wikimedia.org/r/730210 (https://phabricator.wikimedia.org/T236208) [13:03:46] (03PS10) 10BBlack: interface: update rps script to also set the number of queues via ethtool [puppet] - 10https://gerrit.wikimedia.org/r/662688 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [13:03:48] (03PS1) 10BBlack: Mitigate outbound saturation from bytedance [puppet] - 10https://gerrit.wikimedia.org/r/730786 [13:04:39] (03CR) 10David Caro: [C: 03+2] p::prometheus::node_exporter: fix typo in docs [puppet] - 10https://gerrit.wikimedia.org/r/730783 (owner: 10David Caro) [13:04:55] (03CR) 10Filippo Giunchedi: [C: 03+1] Mitigate outbound saturation from bytedance [puppet] - 10https://gerrit.wikimedia.org/r/730786 (owner: 10BBlack) [13:05:04] (03PS2) 10BBlack: Mitigate outbound saturation from bytedance [puppet] - 10https://gerrit.wikimedia.org/r/730786 [13:06:18] (03CR) 10BBlack: [C: 03+2] Mitigate outbound saturation from bytedance [puppet] - 10https://gerrit.wikimedia.org/r/730786 (owner: 10BBlack) [13:09:53] (03PS1) 10Urbanecm: Enable VE by default on 4 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730787 (https://phabricator.wikimedia.org/T290614) [13:09:55] (03PS1) 10Urbanecm: Deploy Growth wikis to 4 wikis in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730788 (https://phabricator.wikimedia.org/T291826) [13:10:08] (03PS1) 10BBlack: Bytedance: further reqrate reduction [puppet] - 10https://gerrit.wikimedia.org/r/730789 [13:11:46] (Primary inbound port utilisation over 80% #page) resolved: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [13:11:46] (Primary inbound port utilisation over 80% #page) resolved: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [13:12:32] yay [13:12:49] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [13:12:49] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [13:13:59] (03PS3) 10Jbond: P:prometheus::node_exporter: update node_exporter to a profile [puppet] - 10https://gerrit.wikimedia.org/r/730750 [13:14:01] (03PS1) 10Jbond: P:prometheus/node_exporter: use include during transision [puppet] - 10https://gerrit.wikimedia.org/r/730791 [13:14:30] !log uploaded orchestrator 3.2.6-1 packages to apt.wm.o (buster) T275784 [13:14:32] (03CR) 10Elukey: [C: 03+2] Add the kserve and kserve-inference charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/730761 (https://phabricator.wikimedia.org/T293331) (owner: 10Elukey) [13:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:36] T275784: orchestrator: Upgrade to v3.2.6 - https://phabricator.wikimedia.org/T275784 [13:15:54] (03PS2) 10BBlack: Bytedance: further reqrate reduction [puppet] - 10https://gerrit.wikimedia.org/r/730789 [13:19:07] (03PS11) 10ZPapierski: [WIP] Add kafka position transfer to wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/727021 (https://phabricator.wikimedia.org/T276469) [13:19:34] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31690/console" [puppet] - 10https://gerrit.wikimedia.org/r/730791 (owner: 10Jbond) [13:19:47] (03CR) 10David Caro: [C: 03+2] P:prometheus/node_exporter: use include during transision [puppet] - 10https://gerrit.wikimedia.org/r/730791 (owner: 10Jbond) [13:19:59] (03PS2) 10Elukey: Move the ml-serve cluster to KServe [deployment-charts] - 10https://gerrit.wikimedia.org/r/730762 (https://phabricator.wikimedia.org/T293331) [13:20:54] (03PS2) 10David Caro: p::prometheus::node_exporter: fix typo in docs [puppet] - 10https://gerrit.wikimedia.org/r/730783 [13:21:01] (03PS2) 10David Caro: standard::prometheus: include the profile instead of the role [puppet] - 10https://gerrit.wikimedia.org/r/730785 [13:21:03] (03PS1) 10MMandere: prometheus: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/730793 (https://phabricator.wikimedia.org/T282787) [13:21:56] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31691/console" [puppet] - 10https://gerrit.wikimedia.org/r/730785 (owner: 10David Caro) [13:22:40] (03CR) 10David Caro: [V: 03+1 C: 03+2] standard::prometheus: include the profile instead of the role [puppet] - 10https://gerrit.wikimedia.org/r/730785 (owner: 10David Caro) [13:23:12] (03PS1) 10DCausse: wdqs: enable the streaming updater on wdqs2005 [puppet] - 10https://gerrit.wikimedia.org/r/730794 (https://phabricator.wikimedia.org/T288231) [13:23:14] (03PS1) 10DCausse: wdqs: enable the streaming updater on wdqs2006 [puppet] - 10https://gerrit.wikimedia.org/r/730795 (https://phabricator.wikimedia.org/T288231) [13:23:16] (03PS1) 10DCausse: wdqs: enable the streaming updater on wdqs2001 [puppet] - 10https://gerrit.wikimedia.org/r/730796 (https://phabricator.wikimedia.org/T288231) [13:23:18] (03PS1) 10DCausse: wdqs: enable the streaming updater on wdqs2002 [puppet] - 10https://gerrit.wikimedia.org/r/730797 (https://phabricator.wikimedia.org/T288231) [13:23:20] (03PS1) 10DCausse: wdqs: enable the streaming updater on wdqs2003 [puppet] - 10https://gerrit.wikimedia.org/r/730798 (https://phabricator.wikimedia.org/T288231) [13:23:22] (03PS1) 10DCausse: wdqs: enable the streaming updater on wdqs2004 [puppet] - 10https://gerrit.wikimedia.org/r/730799 (https://phabricator.wikimedia.org/T288231) [13:23:24] (03PS1) 10DCausse: wdqs: enable the streaming updater on wdqs2007 [puppet] - 10https://gerrit.wikimedia.org/r/730800 (https://phabricator.wikimedia.org/T288231) [13:24:04] (03PS12) 10Gehel: Add kafka position transfer to wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/727021 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [13:25:11] (03PS1) 10Muehlenhoff: Fix arg name [cookbooks] - 10https://gerrit.wikimedia.org/r/730801 [13:28:03] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:28:09] (03PS13) 10MSantos: maps: Add script to send tile invalidation events [puppet] - 10https://gerrit.wikimedia.org/r/722825 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [13:30:19] (03PS2) 10Ayounsi: scripts: add support for new VLAN names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730767 (https://phabricator.wikimedia.org/T293294) (owner: 10Volans) [13:30:29] (03CR) 10MSantos: maps: Add script to send tile invalidation events (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/722825 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [13:31:58] (03CR) 10Gehel: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/727021 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [13:33:31] ^ should be fixed [13:33:52] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:11] (03Abandoned) 10David Caro: standard::prometheus: include the profile instead of the role [puppet] - 10https://gerrit.wikimedia.org/r/730785 (owner: 10David Caro) [13:34:42] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/730750 (owner: 10Jbond) [13:35:18] (03CR) 10Volans: [C: 03+1] "LGTM, just couple of non-blocking optional comments inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/727021 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [13:36:02] (03CR) 10Ema: [C: 03+1] apt: add a service description for apt to allow DNS discovery [dns] - 10https://gerrit.wikimedia.org/r/730524 (owner: 10Jbond) [13:36:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet [13:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:42] (03CR) 10Ema: [C: 03+1] apt: add a service description for apt to allow DNS discovery [puppet] - 10https://gerrit.wikimedia.org/r/730523 (owner: 10Jbond) [13:37:22] 10SRE, 10Language-Team (Language-2021-October-December): Remove Matxin Key from Production - https://phabricator.wikimedia.org/T292635 (10Pginer-WMF) 05Open→03Resolved [13:38:38] (03CR) 10Ayounsi: [C: 03+1] "LGTM!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730767 (https://phabricator.wikimedia.org/T293294) (owner: 10Volans) [13:41:59] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, +Jaime for awareness" [puppet] - 10https://gerrit.wikimedia.org/r/730793 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [13:43:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet [13:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:23] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31692/console" [puppet] - 10https://gerrit.wikimedia.org/r/730750 (owner: 10Jbond) [13:45:26] (03CR) 10David Caro: [V: 03+1 C: 03+1] P:prometheus::node_exporter: update node_exporter to a profile [puppet] - 10https://gerrit.wikimedia.org/r/730750 (owner: 10Jbond) [13:45:32] (03CR) 10Volans: [C: 03+1] "One nit and suggestion inline, lgtm." [cookbooks] - 10https://gerrit.wikimedia.org/r/730801 (owner: 10Muehlenhoff) [13:46:09] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:47:57] (03CR) 10Filippo Giunchedi: [C: 03+1] P:prometheus::node_exporter: update node_exporter to a profile [puppet] - 10https://gerrit.wikimedia.org/r/730750 (owner: 10Jbond) [13:48:12] (03PS1) 10Ayounsi: Add et- interface support to Netbox script/report [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730804 (https://phabricator.wikimedia.org/T293294) [13:48:33] (03CR) 10Elukey: [C: 03+2] Move the ml-serve cluster to KServe [deployment-charts] - 10https://gerrit.wikimedia.org/r/730762 (https://phabricator.wikimedia.org/T293331) (owner: 10Elukey) [13:49:09] (03PS3) 10CDanis: add httpbb.main to console-scripts entry_points [software/httpbb] - 10https://gerrit.wikimedia.org/r/640256 [13:50:14] (03CR) 10Volans: [C: 03+2] scripts: add support for new VLAN names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730767 (https://phabricator.wikimedia.org/T293294) (owner: 10Volans) [13:50:52] (03CR) 10Muehlenhoff: Fix arg name (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/730801 (owner: 10Muehlenhoff) [13:50:54] (03PS2) 10Muehlenhoff: sre.ganeti.addnode: Fix arg name [cookbooks] - 10https://gerrit.wikimedia.org/r/730801 [13:50:56] (03Merged) 10jenkins-bot: scripts: add support for new VLAN names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730767 (https://phabricator.wikimedia.org/T293294) (owner: 10Volans) [13:51:11] (03CR) 10jerkins-bot: [V: 04-1] add httpbb.main to console-scripts entry_points [software/httpbb] - 10https://gerrit.wikimedia.org/r/640256 (owner: 10CDanis) [13:52:23] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:32] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:22] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730804 (https://phabricator.wikimedia.org/T293294) (owner: 10Ayounsi) [13:54:21] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:54:22] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:42] (03PS2) 10Ayounsi: Add et- interface support to Netbox script/report [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730804 (https://phabricator.wikimedia.org/T293294) [13:54:42] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:04] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:57] (03CR) 10Ayounsi: [C: 03+2] Add et- interface support to Netbox script/report [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730804 (https://phabricator.wikimedia.org/T293294) (owner: 10Ayounsi) [13:56:18] (03CR) 10Urbanecm: [C: 03+2] Enable VE by default on 4 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730787 (https://phabricator.wikimedia.org/T290614) (owner: 10Urbanecm) [13:56:25] (03CR) 10Urbanecm: [C: 03+2] Deploy Growth wikis to 4 wikis in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730788 (https://phabricator.wikimedia.org/T291826) (owner: 10Urbanecm) [13:56:37] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [13:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:05] (03Merged) 10jenkins-bot: Enable VE by default on 4 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730787 (https://phabricator.wikimedia.org/T290614) (owner: 10Urbanecm) [13:57:09] (03Merged) 10jenkins-bot: Deploy Growth wikis to 4 wikis in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730788 (https://phabricator.wikimedia.org/T291826) (owner: 10Urbanecm) [13:57:28] (03PS1) 10Jbond: C:prometheus::node_puppet_agent: drop dependency on node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/730806 [13:58:56] (03PS1) 10Elukey: helmfile.d: update the deploy-kserve cluster role to KServe [deployment-charts] - 10https://gerrit.wikimedia.org/r/730807 (https://phabricator.wikimedia.org/T293331) [14:03:10] !log urbanecm@deploy1002 Synchronized dblists/visualeditor-nondefault.dblist: 82d0a4bf45126ecba2cfcd1a0c2081a00f58dca3: Enable VE by default on 4 more wikis (T290614) (duration: 01m 05s) [14:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:16] T290614: Activation of the visual editor by default on a few wikis - https://phabricator.wikimedia.org/T290614 [14:03:27] (03CR) 10Elukey: [C: 03+2] helmfile.d: update the deploy-kserve cluster role to KServe [deployment-charts] - 10https://gerrit.wikimedia.org/r/730807 (https://phabricator.wikimedia.org/T293331) (owner: 10Elukey) [14:04:11] RECOVERY - Check systemd state on ms-be2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:47] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b35adfc59eec9c19b509bb9439cdfe33978a4f8b: Deploy Growth wikis to 4 wikis in dark mode (T291826; 1/2) (duration: 01m 04s) [14:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:53] T291826: Deploy Growth features on Gan, Inuktitut and Tajik Wikipedia - https://phabricator.wikimedia.org/T291826 [14:05:05] 10SRE, 10Platform Engineering: Degraded RAID on sessionstore1003 - https://phabricator.wikimedia.org/T291738 (10hnowlan) [14:05:47] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:05:49] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:11] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [14:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:10] !log Create growthexperiments DB tables for ganwiki, iuwiki, tgwiki (T291826) [14:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:23] !log Run extensions/GrowthExperiments/initWikiConfig.php for ganwiki, iuwiki, tgwiki (T291826) [14:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:49] 10SRE, 10Platform Engineering: Degraded RAID on sessionstore1003 - https://phabricator.wikimedia.org/T291738 (10hnowlan) Thanks @Jclark-ctr and @Cmjohnson! I have remirrored the disk via: ` sfdisk -d /dev/sda | sfdisk /dev/sdb ` Layout looks fine: ` root@sessionstore1003:/home/hnowlan# fdisk -l /dev/sdb Dis... [14:08:47] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Patch-For-Review, 10User-notice: Replace Liberation 1 fonts with Liberation 2 for svg rendering - https://phabricator.wikimedia.org/T253600 (10Johan) Do you know when this would be in production, or is it too early to tell? [14:09:55] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [14:10:22] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: b35adfc59eec9c19b509bb9439cdfe33978a4f8b: Deploy Growth wikis to 4 wikis in dark mode (T291826; 2/2) (duration: 01m 03s) [14:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:28] T291826: Deploy Growth features on Gan, Inuktitut and Tajik Wikipedia - https://phabricator.wikimedia.org/T291826 [14:10:29] * urbanecm done [14:11:10] I suspect the logstash indexing errors are due to the recent deploy, investigating [14:11:33] (03PS2) 10Jbond: C:prometheus::node_puppet_agent: drop dependency on node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/730806 [14:11:36] (03PS4) 10Jbond: P:prometheus::node_exporter: update node_exporter to a profile [puppet] - 10https://gerrit.wikimedia.org/r/730750 [14:11:52] (03CR) 10Jgiannelos: [C: 03+1] maps: Add script to send tile invalidation events [puppet] - 10https://gerrit.wikimedia.org/r/722825 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [14:11:54] godog: i hope you don't mean _my_ deploy, I don't see how i could cause elasticsearch indexing error. [14:12:29] happy to revert if needed :)) [14:12:57] urbanecm: no I don't mean your deploy :) it likely a knative thing we've seen before [14:13:02] elukey: ^ [14:13:28] urbanecm: thanks for checking though, appreciate it [14:13:28] good :)) [14:13:30] godog: ah yes sorry I am working on it, should be fixed soon :( [14:13:40] thanks urbanecm ! Sorry for the extra trouble [14:14:08] elukey: ah ok, yeah this is the invalid json in knative logs we had a task a couple of weeks ago [14:14:12] so "known" [14:14:20] no problem at all. As long as I didn't actually break anything :)) [14:16:21] godog: yes I am not sure how to fix it, it is surely a bug in knative but we can't upgrade atm.. [14:17:03] maybe I can prevent logs to be shipped to logstash [14:19:07] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [14:19:11] 10SRE, 10ops-drmrs, 10DNS, 10Infrastructure-Foundations, and 2 others: setup drmrs mgmt & private prefixs - question on switch status - https://phabricator.wikimedia.org/T293294 (10ayounsi) I created a bunch of mgmt related cables: https://netbox.wikimedia.org/dcim/cables/?q=&site=drmrs&type=&status=&color... [14:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:21] elukey: I don't know tbh what's up, it looks like "knative_dev/key" sometimes is a text field and sometimes is a nested object with "knative.dev/key" (notice . vs _) [14:19:35] and sometimes changes on deploy [14:23:27] !log installing krb5 security updates on KDCs [14:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:03] godog: it seems to change when something is off (I am upgrading pods and they don't come up), is there any way to say add a rule to logstash to accept both? (very ignorant question) [14:26:59] elukey: kinda, we could mangle the field to un-nest it [14:29:02] I've updated T288549 [14:29:02] T288549: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 [14:29:31] (03PS1) 10Ayounsi: Fix cable type/color for 25G server links [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730810 (https://phabricator.wikimedia.org/T293294) [14:29:34] (03PS1) 10Bartosz Dziewoński: Fix value of 'namespacesWithSubpages' in wgVisualEditorConfig [extensions/VisualEditor] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730729 (https://phabricator.wikimedia.org/T293310) [14:29:52] godog: that would be great, lemme know when you have a moment to discuss it, no idea where to start [14:31:41] (03PS1) 10BryanDavis: toolhub: Bump container version to 2021-10-14-111459-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730812 [14:31:55] if anyone is available and would like to fix a deployment blocker… https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/730729 [14:32:03] elukey: yeah following up on the task is best I think, sh.dubsh would have a better idea on what needs to happen [14:32:11] (otherwise i'll schedule it for the backport window) [14:32:49] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730810 (https://phabricator.wikimedia.org/T293294) (owner: 10Ayounsi) [14:33:11] godog: <3 [14:33:33] (03CR) 10Jbond: [C: 03+2] apt: add a service description for apt to allow DNS discovery [puppet] - 10https://gerrit.wikimedia.org/r/730523 (owner: 10Jbond) [14:34:55] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [14:35:05] (03PS1) 10DCausse: wdqs: enable the streaming updater on wdqs1003 [puppet] - 10https://gerrit.wikimedia.org/r/730814 (https://phabricator.wikimedia.org/T288231) [14:35:07] (03PS1) 10DCausse: wdqs: enable the streaming updater on wdqs1008 [puppet] - 10https://gerrit.wikimedia.org/r/730815 (https://phabricator.wikimedia.org/T288231) [14:35:09] (03PS13) 10ZPapierski: sre.wdqs: Add kafka position transfer to wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/727021 (https://phabricator.wikimedia.org/T276469) [14:35:11] (03PS1) 10DCausse: wdqs: enable the streaming updater on wdqs1011 [puppet] - 10https://gerrit.wikimedia.org/r/730816 (https://phabricator.wikimedia.org/T288231) [14:35:13] (03PS1) 10DCausse: wdqs: enable the streaming updater on wdqs1004 [puppet] - 10https://gerrit.wikimedia.org/r/730817 (https://phabricator.wikimedia.org/T288231) [14:35:15] (03PS1) 10DCausse: wdqs: enable the streaming updater on wdqs1005 [puppet] - 10https://gerrit.wikimedia.org/r/730818 (https://phabricator.wikimedia.org/T288231) [14:35:17] (03PS1) 10DCausse: wdqs: enable the streaming updater on wdqs1006 [puppet] - 10https://gerrit.wikimedia.org/r/730819 (https://phabricator.wikimedia.org/T288231) [14:35:19] (03PS1) 10DCausse: wdqs: enable the streaming updater on wdqs1007 [puppet] - 10https://gerrit.wikimedia.org/r/730820 (https://phabricator.wikimedia.org/T288231) [14:35:21] (03PS1) 10DCausse: wdqs: enable the streaming updater on wdqs1012 [puppet] - 10https://gerrit.wikimedia.org/r/730821 (https://phabricator.wikimedia.org/T288231) [14:35:23] (03PS1) 10DCausse: wdqs: enable the streaming updater on wdqs1013 [puppet] - 10https://gerrit.wikimedia.org/r/730822 (https://phabricator.wikimedia.org/T288231) [14:36:25] MatmaRex: looking for a deploy? [14:37:13] dancy: yeah, if you have the time? i'm not a deployer [14:37:22] OK. hitting +2 [14:38:01] (03CR) 10Ahmon Dancy: [C: 03+2] Fix value of 'namespacesWithSubpages' in wgVisualEditorConfig [extensions/VisualEditor] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730729 (https://phabricator.wikimedia.org/T293310) (owner: 10Bartosz Dziewoński) [14:38:06] (03PS1) 10Volans: scripts: fix support for new VLAN names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730823 [14:38:44] (03CR) 10Ayounsi: [C: 03+2] Fix cable type/color for 25G server links [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730810 (https://phabricator.wikimedia.org/T293294) (owner: 10Ayounsi) [14:39:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [14:40:51] (03PS4) 10Zabe: Dumps: Clarify licensing for Wikidata and update various links [puppet] - 10https://gerrit.wikimedia.org/r/730243 (https://phabricator.wikimedia.org/T279436) (owner: 10Dylsss) [14:40:55] (03CR) 10David Caro: "Can we add a requirement on the directory that might be missing? that will force the ordering of the classes right?" [puppet] - 10https://gerrit.wikimedia.org/r/730806 (owner: 10Jbond) [14:41:46] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.addnode: Fix arg name [cookbooks] - 10https://gerrit.wikimedia.org/r/730801 (owner: 10Muehlenhoff) [14:42:44] (03CR) 10Zabe: "(fixed the commit message in order to make CI pass, hope thats ok to you)" [puppet] - 10https://gerrit.wikimedia.org/r/730243 (https://phabricator.wikimedia.org/T279436) (owner: 10Dylsss) [14:43:11] PROBLEM - SSH on ms-be2035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:43:23] !log migrate apt.w.o to a dns active/passiev discovery address (cc moritzm) [14:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:33] (03CR) 10Jbond: [C: 03+2] apt: add a service description for apt to allow DNS discovery [dns] - 10https://gerrit.wikimedia.org/r/730524 (owner: 10Jbond) [14:43:41] doh.. hello old friend ms-be2035 (cc godog, Emperor ) [14:44:05] (03PS14) 10ZPapierski: sre.wdqs: Add kafka position transfer to wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/727021 (https://phabricator.wikimedia.org/T276469) [14:45:07] RECOVERY - SSH on ms-be2035 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:45:11] (03PS1) 10Jbond: Revert "apt: add a service description for apt to allow DNS discovery" [dns] - 10https://gerrit.wikimedia.org/r/730730 [14:45:23] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "apt: add a service description for apt to allow DNS discovery" [dns] - 10https://gerrit.wikimedia.org/r/730730 (owner: 10Jbond) [14:46:07] (03PS1) 10Jbond: apt: add a service description for apt to allow DNS discovery [dns] - 10https://gerrit.wikimedia.org/r/730731 [14:46:59] volans: siiigh, I'll take a look [14:47:48] godog: I didn't touch it, seems back up from icinga, didn't reboot at least [14:48:19] volans: yeah seems "fine", perhaps a blip [14:48:25] can't find anything obvious in dmesg [14:48:27] (03PS1) 10Muehlenhoff: sre.ganeti.makevm: Fix remote [cookbooks] - 10https://gerrit.wikimedia.org/r/730824 [14:48:38] maybe check HW logs [14:48:43] jbond: ack [14:48:45] to see if they reported something [14:48:59] PROBLEM - Check systemd state on ms-be2043 is CRITICAL: CRITICAL - degraded: The following units failed: session-214155.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:20] !log jbond@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=apt [14:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:32] (03CR) 10Volans: [C: 03+1] "LGTMLGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/730824 (owner: 10Muehlenhoff) [14:49:38] (03CR) 10Jbond: [C: 03+2] apt: add a service description for apt to allow DNS discovery [dns] - 10https://gerrit.wikimedia.org/r/730731 (owner: 10Jbond) [14:50:43] (03PS2) 10Muehlenhoff: sre.ganeti.addnode: Fix remote [cookbooks] - 10https://gerrit.wikimedia.org/r/730824 [14:51:00] (03PS1) 10Ayounsi: Fix typo [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730828 [14:51:14] volans: nope sel doesn't have anything recent, I'll let it be [14:51:32] ack [14:51:53] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730828 (owner: 10Ayounsi) [14:52:30] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2021-10-14-111459-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730812 (owner: 10BryanDavis) [14:53:52] !log upgrading orchestrator.wm.o to 3.2.6-1 T275784 [14:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:57] (03CR) 10Gehel: [C: 03+1] "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/727021 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [14:53:58] T275784: orchestrator: Upgrade to v3.2.6 - https://phabricator.wikimedia.org/T275784 [14:54:09] (03CR) 10Gehel: [C: 03+2] sre.wdqs: Add kafka position transfer to wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/727021 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [14:54:48] (03CR) 10Ayounsi: "recheck" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730828 (owner: 10Ayounsi) [14:55:05] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] sre.ganeti.addnode: Fix remote [cookbooks] - 10https://gerrit.wikimedia.org/r/730824 (owner: 10Muehlenhoff) [14:55:11] (03PS3) 10Muehlenhoff: sre.ganeti.addnode: Fix remote [cookbooks] - 10https://gerrit.wikimedia.org/r/730824 [14:55:13] (03PS1) 10Arturo Borrero Gonzalez: openstack: cinder backups: introduce ceph client config [puppet] - 10https://gerrit.wikimedia.org/r/730829 (https://phabricator.wikimedia.org/T292546) [14:55:29] (03PS2) 10Arturo Borrero Gonzalez: openstack: cinder backups: introduce ceph client config [puppet] - 10https://gerrit.wikimedia.org/r/730829 (https://phabricator.wikimedia.org/T292546) [14:56:35] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) Received the PDU today. Mounting the PDU in the rack was very easy, took the PDU out of the box and straight in the rack no adjustment. It is also very l... [14:57:35] (03PS3) 10Jbond: C:prometheus::node_puppet_agent: drop dependency on node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/730806 [14:57:42] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2021-10-14-111459-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730812 (owner: 10BryanDavis) [14:58:04] (03PS3) 10Arturo Borrero Gonzalez: openstack: cinder backups: introduce ceph client config [puppet] - 10https://gerrit.wikimedia.org/r/730829 (https://phabricator.wikimedia.org/T292546) [14:58:41] (03PS4) 10Jbond: C:prometheus::node_puppet_agent: drop dependency on node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/730806 [14:58:53] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] sre.ganeti.addnode: Fix remote [cookbooks] - 10https://gerrit.wikimedia.org/r/730824 (owner: 10Muehlenhoff) [14:59:05] (03CR) 10Jbond: C:prometheus::node_puppet_agent: drop dependency on node_exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730806 (owner: 10Jbond) [14:59:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31695/console" [puppet] - 10https://gerrit.wikimedia.org/r/730806 (owner: 10Jbond) [14:59:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2026.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [14:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:07] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2026.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [15:00:10] (03Merged) 10jenkins-bot: Fix value of 'namespacesWithSubpages' in wgVisualEditorConfig [extensions/VisualEditor] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730729 (https://phabricator.wikimedia.org/T293310) (owner: 10Bartosz Dziewoński) [15:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:14] (03CR) 10Ayounsi: [C: 03+2] Fix typo [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730828 (owner: 10Ayounsi) [15:02:44] !log dancy@deploy1002 Synchronized php-1.38.0-wmf.4/extensions/Collection/includes/CollectionHooks.php: Backport: [[gerrit:730580|Check that the timestamp key/value is set to avoid undefined offset (T293300)]] (duration: 01m 03s) [15:02:49] MatmaRex: Deployed [15:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:51] T293300: PHP Notice: Undefined index: timestamp - https://phabricator.wikimedia.org/T293300 [15:03:33] (03PS4) 10Arturo Borrero Gonzalez: openstack: cinder backups: introduce ceph client config [puppet] - 10https://gerrit.wikimedia.org/r/730829 (https://phabricator.wikimedia.org/T292546) [15:03:52] thanks dancy! [15:04:21] lot message was wrong above.. ignore that [15:05:07] (i'm testing it now) [15:05:20] hold on a sec.. I didn't sync the right file. [15:05:31] yeah, i was gonna ask. i didn't see the effect [15:06:07] is anything going on with the job queue at the moment? [15:06:08] !log dancy@deploy1002 Synchronized php-1.38.0-wmf.4/extensions/VisualEditor/includes/VisualEditorHooks.php: Backport: [[gerrit:730729|Fix value of 'namespacesWithSubpages' in wgVisualEditorConfig (T293310)]] (duration: 01m 04s) [15:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:14] T293310: Unable to change the target of internal links - https://phabricator.wikimedia.org/T293310 [15:06:27] we’re seeing some jobs pile up, since ca 20 minutes ago if I’m not mistaken [15:06:29] https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-job=wikibase-InjectRCRecords&from=now-1h&to=now [15:07:38] MatmaRex: Ok.. really deployed now. [15:07:43] (03PS1) 10Jbond: services: use correct codfw/eqiad addresses for apt [puppet] - 10https://gerrit.wikimedia.org/r/730830 [15:07:55] though other jobs don’t seem to be affected as much afaict [15:07:56] moritzm: can i get a second set of eyes just incase ^^ [15:08:21] looking [15:08:27] thanks [15:09:12] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, both IPs check out." [puppet] - 10https://gerrit.wikimedia.org/r/730830 (owner: 10Jbond) [15:11:06] (03CR) 10Jbond: [C: 03+2] services: use correct codfw/eqiad addresses for apt [puppet] - 10https://gerrit.wikimedia.org/r/730830 (owner: 10Jbond) [15:11:15] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) {F34688676} [15:11:30] thanks dancy, something is annoyingly still cached when viewing pages normally, but it's fixed when using ?debug=1 [15:12:40] 10SRE, 10ops-drmrs, 10DNS, 10Infrastructure-Foundations, and 2 others: setup drmrs mgmt & private prefixs - question on switch status - https://phabricator.wikimedia.org/T293294 (10ayounsi) 05Open→03Resolved v6 assigned as well, server's network provisioning tested with both public/private vlan as dry-... [15:12:56] (03PS1) 10Elukey: Remove any trace of Kubeflow Kfserving [deployment-charts] - 10https://gerrit.wikimedia.org/r/730831 (https://phabricator.wikimedia.org/T293331) [15:13:31] !log bd808@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' . [15:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:56] (03PS1) 10Elukey: Remove Kubeflow Kfserving [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/730832 (https://phabricator.wikimedia.org/T293331) [15:20:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2026.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [15:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2026.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [15:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:32] (03PS9) 10Jbond: sre.misc-clusters.thumbor: create batch action cook book for thumbor [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 [15:23:05] (03CR) 10Jbond: sre.misc-clusters.thumbor: create batch action cook book for thumbor (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond) [15:23:08] (03PS10) 10Jbond: sre.misc-clusters.thumbor: create batch action cook book for thumbor [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 [15:23:57] PROBLEM - Check systemd state on ms-be2041 is CRITICAL: CRITICAL - degraded: The following units failed: session-214189.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:33] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:prometheus::node_puppet_agent: drop dependency on node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/730806 (owner: 10Jbond) [15:27:34] (03PS2) 10Volans: scripts: fix support for new VLAN names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730823 [15:28:44] reported the job queue issue at https://phabricator.wikimedia.org/T293385 [15:29:00] (03PS5) 10Jbond: P:prometheus::node_exporter: update node_exporter to a profile [puppet] - 10https://gerrit.wikimedia.org/r/730750 [15:30:07] RECOVERY - Check systemd state on ms-be2041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:21] (03CR) 10Dylsss: Dumps: Clarify licensing for Wikidata and update various links (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730243 (https://phabricator.wikimedia.org/T279436) (owner: 10Dylsss) [15:30:37] (03CR) 10Jbond: [C: 03+2] P:prometheus::node_exporter: update node_exporter to a profile [puppet] - 10https://gerrit.wikimedia.org/r/730750 (owner: 10Jbond) [15:30:59] (03PS2) 10Ssingh: anycast_monitoring: add checks for Wikidough DoH/DoT [puppet] - 10https://gerrit.wikimedia.org/r/730619 [15:31:54] (03CR) 10Ssingh: anycast_monitoring: add checks for Wikidough DoH/DoT (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730619 (owner: 10Ssingh) [15:32:17] (03PS44) 10Jbond: P:base: move production specific code to their own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [15:32:51] (03CR) 10jerkins-bot: [V: 04-1] P:base: move production specific code to their own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [15:33:14] (03CR) 10Ayounsi: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/730619 (owner: 10Ssingh) [15:33:33] RECOVERY - Check systemd state on ms-be2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:35] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31696/console" [puppet] - 10https://gerrit.wikimedia.org/r/730619 (owner: 10Ssingh) [15:34:13] (03PS1) 10Muehlenhoff: Retire role::mediawiki::common [puppet] - 10https://gerrit.wikimedia.org/r/730836 [15:34:30] (03CR) 10Ayounsi: [C: 03+1] "Maybe add some comments to explain the logic, but LGTM!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730823 (owner: 10Volans) [15:34:38] (03PS45) 10Jbond: P:base: move production specific code to their own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [15:35:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/730836 (owner: 10Muehlenhoff) [15:37:59] (03PS1) 10Jbond: adduser: move login.defs config to adduser [puppet] - 10https://gerrit.wikimedia.org/r/730837 [15:39:16] (03CR) 10Jbond: "i thought i had already created this CR but if i did i can't find it, so sorry if you already reviewed it :S" [puppet] - 10https://gerrit.wikimedia.org/r/730837 (owner: 10Jbond) [15:43:14] (03PS3) 10Volans: scripts: fix support for new VLAN names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730823 (https://phabricator.wikimedia.org/T283594) [15:49:03] (03CR) 10Ayounsi: [C: 03+1] scripts: fix support for new VLAN names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730823 (https://phabricator.wikimedia.org/T283594) (owner: 10Volans) [15:49:38] (03CR) 10Volans: [C: 03+2] scripts: fix support for new VLAN names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730823 (https://phabricator.wikimedia.org/T283594) (owner: 10Volans) [15:50:01] (03CR) 10Ssingh: [V: 03+1 C: 03+2] anycast_monitoring: add checks for Wikidough DoH/DoT [puppet] - 10https://gerrit.wikimedia.org/r/730619 (owner: 10Ssingh) [15:50:28] (03Merged) 10jenkins-bot: scripts: fix support for new VLAN names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/730823 (https://phabricator.wikimedia.org/T283594) (owner: 10Volans) [15:52:51] !log T288231 `ryankemper@wdqs2005:~$ sudo depool` [15:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:57] T288231: Deploy the wdqs streaming updater to production - https://phabricator.wikimedia.org/T288231 [15:54:18] !log T288231 `ryankemper@wdqs2008:~$ sudo depool` [15:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:21] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [15:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:59] jouncebot: now [15:56:59] No deployments scheduled for the next 0 hour(s) and 3 minute(s) [15:57:11] jouncebot: next [15:57:12] In 0 hour(s) and 2 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211014T1600) [16:00:05] jbond and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211014T1600). [16:00:05] Dylsss: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:39] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: enable the streaming updater on wdqs2005 [puppet] - 10https://gerrit.wikimedia.org/r/730794 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [16:00:50] Dylsss: looking [16:01:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_streaming_updater site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:03:26] apergos are you able to look at and advice on https://gerrit.wikimedia.org/r/c/operations/puppet/+/730243 [16:04:53] !log T288231 `ryankemper@wdqs2005:~$ sudo run-puppet-agent --force` [16:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:00] T288231: Deploy the wdqs streaming updater to production - https://phabricator.wikimedia.org/T288231 [16:06:49] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: auto-detect systemd watchdog [software/acme-chief] - 10https://gerrit.wikimedia.org/r/730749 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [16:07:06] jbond: legal usually takes care of that sort of thing although I may merge tings once I know they have legal team thumbs up [16:07:07] !log T288231 About to ctrl+c out of ongoing data transfer because puppet run following merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/730794 restarted blazegraph; we'll manually disable updater and kick off the transfer again [16:07:09] *things [16:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:43] reading the description I really don't know if those changes should be made or not, I've no useful thoughts on the matter [16:07:44] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [16:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:04] apergos: ack, me neither any idea who might? [16:08:52] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [16:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:46] well, whoever in legal deals with copyright and stuff on the wikis these days, I guess. but I don't know who that is [16:12:34] I know they don't really use phabricator but at least it could be mentioned on the task that they should give a thumbs up to the changes and then maybe loop the team in via email? I'm sorry I'm not coming up with anything better here [16:12:57] (03PS1) 10Herron: centrallog2002: apply role::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/730843 (https://phabricator.wikimedia.org/T292196) [16:13:28] apergos: ack thanks [16:13:30] (03CR) 10jerkins-bot: [V: 04-1] centrallog2002: apply role::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/730843 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron) [16:13:44] feel free to add me as a subcscriber on the task if you want me to follow along [16:14:08] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@071f7c3]: Increase mirrored traffic to 100% for eqiad [16:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:17] ack thanks will do [16:14:37] (03PS2) 10Herron: centrallog2002: apply role::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/730843 (https://phabricator.wikimedia.org/T292196) [16:16:24] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/730843 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron) [16:16:49] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@071f7c3]: Increase mirrored traffic to 100% for eqiad (duration: 02m 41s) [16:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:40] (03CR) 10Ahmon Dancy: [C: 03+2] Check that the timestamp key/value is set to avoid undefined offset [extensions/Collection] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730580 (https://phabricator.wikimedia.org/T293300) (owner: 10Daniel Kinzler) [16:18:56] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10RobH) [16:19:04] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10RobH) [16:19:32] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10RobH) a:03Jclark-ctr [16:22:50] (03Merged) 10jenkins-bot: Check that the timestamp key/value is set to avoid undefined offset [extensions/Collection] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730580 (https://phabricator.wikimedia.org/T293300) (owner: 10Daniel Kinzler) [16:24:58] !log dancy@deploy1002 Synchronized php-1.38.0-wmf.4/extensions/Collection/includes/CollectionHooks.php: Backport: [[gerrit:730580|Check that the timestamp key/value is set to avoid undefined offset (T293300)]] (duration: 01m 04s) [16:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:07] T293300: PHP Notice: Undefined index: timestamp - https://phabricator.wikimedia.org/T293300 [16:25:49] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@4bff2d1]: Force mirrored traffic to 0% for everywhere [16:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:42] (03PS1) 10Jgiannelos: Configure event stream for map tiles state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730848 (https://phabricator.wikimedia.org/T289771) [16:27:47] (03CR) 10Michael Große: "Please have a look whether my reasoning about the number in the concurrency makes sense to you." [deployment-charts] - 10https://gerrit.wikimedia.org/r/730846 (https://phabricator.wikimedia.org/T293385) (owner: 10Michael Große) [16:28:13] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@4bff2d1]: Force mirrored traffic to 0% for everywhere (duration: 02m 24s) [16:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:55] (03CR) 10Jgiannelos: [C: 04-1] "Block merging until deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730848 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [16:29:03] (03PS1) 10Jbond: standard::ntp: move standard ntp to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/730852 [16:30:43] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:31:33] (03CR) 10Dzahn: gitlab::ssh explicitly add git user with fixed id (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [16:32:14] (03CR) 10Dzahn: gitlab::ssh explicitly add git user with fixed id (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [16:32:22] (03CR) 10Dzahn: [C: 03+1] gitlab::ssh explicitly add git user with fixed id [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [16:32:27] It would be great if I could get a review for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/730846 to work around the job queue issue that is impacting important parts of Wikidata change dispatching for almost two hours now. (related: https://phabricator.wikimedia.org/T293385) [16:32:33] (03CR) 10Elukey: [V: 03+2 C: 03+2] Remove Kubeflow Kfserving [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/730832 (https://phabricator.wikimedia.org/T293331) (owner: 10Elukey) [16:33:14] !log installing node-ansi-regex security updates [16:33:17] (03PS2) 10Jbond: standard::ntp: move standard ntp to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/730852 [16:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:19] (03PS1) 10Jbond: P:mail::default_mail_relay: move templates to correct location [puppet] - 10https://gerrit.wikimedia.org/r/730853 [16:33:37] (03PS1) 10Nray: Change VectorPrefDiffInstrumentation stream name to `mediawiki.skin_diff` [extensions/WikimediaEvents] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730732 (https://phabricator.wikimedia.org/T289622) [16:33:57] (03CR) 10Herron: "xionox are there any side effects to look out for when deploying bird to a new host for the first time? AIUI down the line we'll need to " [puppet] - 10https://gerrit.wikimedia.org/r/730843 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron) [16:34:08] (03CR) 10Muehlenhoff: P:mail::default_mail_relay: move templates to correct location (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730853 (owner: 10Jbond) [16:34:48] (03PS2) 10Jbond: P:mail::default_mail_relay: move templates to correct location [puppet] - 10https://gerrit.wikimedia.org/r/730853 [16:34:52] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [16:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:07] (03CR) 10Jbond: "fixed" [puppet] - 10https://gerrit.wikimedia.org/r/730853 (owner: 10Jbond) [16:35:20] (03CR) 10jerkins-bot: [V: 04-1] P:mail::default_mail_relay: move templates to correct location [puppet] - 10https://gerrit.wikimedia.org/r/730853 (owner: 10Jbond) [16:35:51] (03PS1) 10Nray: Change VectorPrefDiffInstrumentation stream name to `mediawiki.skin_diff` [extensions/WikimediaEvents] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730733 (https://phabricator.wikimedia.org/T289622) [16:35:56] (03CR) 10jerkins-bot: [V: 04-1] standard::ntp: move standard ntp to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/730852 (owner: 10Jbond) [16:36:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31699/console" [puppet] - 10https://gerrit.wikimedia.org/r/730853 (owner: 10Jbond) [16:36:16] (03CR) 10Jakob: [C: 03+1] Raise the priority ofwikibase-InjectRCRecords job [deployment-charts] - 10https://gerrit.wikimedia.org/r/730846 (https://phabricator.wikimedia.org/T293385) (owner: 10Michael Große) [16:36:17] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.11 point update - https://phabricator.wikimedia.org/T292838 (10MoritzMuehlenhoff) [16:36:19] (03CR) 10Jbond: "PCC - wmcs: https://puppet-compiler.wmflabs.org/compiler1003/31697/" [puppet] - 10https://gerrit.wikimedia.org/r/730853 (owner: 10Jbond) [16:36:34] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [16:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:11] !log drop kubeflow-kfserving* docker images from deneb [16:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:57] (03PS3) 10Jbond: standard::ntp: move standard ntp to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/730852 [16:39:33] (03CR) 10Elukey: [C: 03+2] Remove any trace of Kubeflow Kfserving [deployment-charts] - 10https://gerrit.wikimedia.org/r/730831 (https://phabricator.wikimedia.org/T293331) (owner: 10Elukey) [16:40:47] (03PS3) 10Jbond: P:mail::default_mail_relay: move templates to correct location [puppet] - 10https://gerrit.wikimedia.org/r/730853 [16:40:55] (03PS4) 10Jbond: standard::ntp: move standard ntp to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/730852 [16:41:18] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [16:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:59] (03PS5) 10Jbond: standard::ntp: move standard ntp to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/730852 [16:43:28] (03PS5) 10Arturo Borrero Gonzalez: openstack: cinder backups: introduce ceph client config [puppet] - 10https://gerrit.wikimedia.org/r/730829 (https://phabricator.wikimedia.org/T292546) [16:44:13] !log T288231 Manually killed dangling `pigz` / `nc` processes on `wdqs2008` (and `wdqs2005` implicitly). Should be in the right state to re-start the `data-transfer` cookbook from again [16:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:21] T288231: Deploy the wdqs streaming updater to production - https://phabricator.wikimedia.org/T288231 [16:44:32] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [16:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:21] (03CR) 10Addshore: [C: 03+1] "+1, as long as this is actually raising it (I'm not sure what the default is)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/730846 (https://phabricator.wikimedia.org/T293385) (owner: 10Michael Große) [16:45:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, if we add new parameter and given that we'll switch to Postfix in the near future we could just as well call this profile::mai" [puppet] - 10https://gerrit.wikimedia.org/r/730853 (owner: 10Jbond) [16:45:50] (03PS1) 10Jbond: standard: remove standard module [puppet] - 10https://gerrit.wikimedia.org/r/730856 [16:47:05] (03PS2) 10Addshore: Raise the priority of wikibase-InjectRCRecords job [deployment-charts] - 10https://gerrit.wikimedia.org/r/730846 (https://phabricator.wikimedia.org/T293385) (owner: 10Michael Große) [16:49:39] 10SRE, 10Platform Engineering: Degraded RAID on sessionstore1003 - https://phabricator.wikimedia.org/T291738 (10hnowlan) 05Open→03Resolved [16:51:47] (03PS6) 10Jbond: standard::ntp: move standard ntp to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/730852 [16:51:49] (03PS1) 10Jbond: standrd::ntp: fix ntp order [puppet] - 10https://gerrit.wikimedia.org/r/730857 [16:52:02] (03PS2) 10Jbond: standard: remove standard module [puppet] - 10https://gerrit.wikimedia.org/r/730856 [16:53:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31700/console" [puppet] - 10https://gerrit.wikimedia.org/r/730857 (owner: 10Jbond) [16:53:59] RECOVERY - MD RAID on sessionstore1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:54:38] (03CR) 10jerkins-bot: [V: 04-1] standrd::ntp: fix ntp order [puppet] - 10https://gerrit.wikimedia.org/r/730857 (owner: 10Jbond) [16:55:19] (03Abandoned) 10Nray: Change VectorPrefDiffInstrumentation stream name to `mediawiki.skin_diff` [extensions/WikimediaEvents] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730733 (https://phabricator.wikimedia.org/T289622) (owner: 10Nray) [16:55:26] (03CR) 10Nray: "On second thought, I'll let it ride the train" [extensions/WikimediaEvents] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730732 (https://phabricator.wikimedia.org/T289622) (owner: 10Nray) [16:55:37] (03Abandoned) 10Nray: Change VectorPrefDiffInstrumentation stream name to `mediawiki.skin_diff` [extensions/WikimediaEvents] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730732 (https://phabricator.wikimedia.org/T289622) (owner: 10Nray) [16:57:19] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 103, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:58:13] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31701/console" [puppet] - 10https://gerrit.wikimedia.org/r/730857 (owner: 10Jbond) [17:00:05] chrisalbon and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211014T1700). [17:04:00] (03CR) 10Jbond: [C: 04-1] Standardize the stats system user uid (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata) [17:04:31] (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/725286 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [17:06:18] (03CR) 10Addshore: Raise the priority of wikibase-InjectRCRecords job (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/730846 (https://phabricator.wikimedia.org/T293385) (owner: 10Michael Große) [17:09:25] (03CR) 10Dzahn: [C: 03+1] "looks good to me, compiler output (from experimental and my own spot check) looks fine. Just because I seem to remember we have had discus" [puppet] - 10https://gerrit.wikimedia.org/r/730836 (owner: 10Muehlenhoff) [17:10:29] (03CR) 10Michael Große: Raise the priority of wikibase-InjectRCRecords job (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/730846 (https://phabricator.wikimedia.org/T293385) (owner: 10Michael Große) [17:10:36] (03PS1) 10Jbond: puppetmaster: fix spec test [puppet] - 10https://gerrit.wikimedia.org/r/730861 [17:11:15] (03PS2) 10Jbond: puppetmaster: fix spec test [puppet] - 10https://gerrit.wikimedia.org/r/730861 [17:12:22] (03CR) 10Jbond: [C: 03+2] puppetmaster: fix spec test [puppet] - 10https://gerrit.wikimedia.org/r/730861 (owner: 10Jbond) [17:13:10] Waiting 18:00 UTC... [17:13:31] (03PS2) 10Jbond: adduser: move login.defs config to adduser [puppet] - 10https://gerrit.wikimedia.org/r/730837 [17:13:33] (03PS4) 10Jbond: P:mail::default_mail_relay: move templates to correct location [puppet] - 10https://gerrit.wikimedia.org/r/730853 [17:13:35] (03PS2) 10Jbond: standrd::ntp: fix ntp order [puppet] - 10https://gerrit.wikimedia.org/r/730857 [17:13:37] (03PS7) 10Jbond: standard::ntp: move standard ntp to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/730852 [17:13:39] (03PS3) 10Jbond: standard: remove standard module [puppet] - 10https://gerrit.wikimedia.org/r/730856 [17:17:20] (03CR) 10Addshore: Raise the priority of wikibase-InjectRCRecords job (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/730846 (https://phabricator.wikimedia.org/T293385) (owner: 10Michael Große) [17:17:25] (03CR) 10Addshore: [C: 03+1] Raise the priority of wikibase-InjectRCRecords job [deployment-charts] - 10https://gerrit.wikimedia.org/r/730846 (https://phabricator.wikimedia.org/T293385) (owner: 10Michael Große) [17:18:32] (03CR) 10Hnowlan: [C: 03+2] Raise the priority of wikibase-InjectRCRecords job [deployment-charts] - 10https://gerrit.wikimedia.org/r/730846 (https://phabricator.wikimedia.org/T293385) (owner: 10Michael Große) [17:19:13] (03CR) 10Michael Große: Raise the priority of wikibase-InjectRCRecords job (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/730846 (https://phabricator.wikimedia.org/T293385) (owner: 10Michael Große) [17:20:58] (03PS1) 10Dzahn: osm: convert common role to profile, avoid role inside role [puppet] - 10https://gerrit.wikimedia.org/r/730862 [17:23:13] (03Merged) 10jenkins-bot: Raise the priority of wikibase-InjectRCRecords job [deployment-charts] - 10https://gerrit.wikimedia.org/r/730846 (https://phabricator.wikimedia.org/T293385) (owner: 10Michael Große) [17:24:43] (03CR) 10Dzahn: "I'm not sure where exactly this is used though." [puppet] - 10https://gerrit.wikimedia.org/r/730862 (owner: 10Dzahn) [17:29:11] !log addshore@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [17:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:33] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:30:56] (03PS1) 10Dzahn: builder/systemtap: convert systemtap::devserver to a profile [puppet] - 10https://gerrit.wikimedia.org/r/730863 [17:31:11] !log addshore@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [17:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:41] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:32:27] (03CR) 10jerkins-bot: [V: 04-1] builder/systemtap: convert systemtap::devserver to a profile [puppet] - 10https://gerrit.wikimedia.org/r/730863 (owner: 10Dzahn) [17:32:58] !log addshore@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [17:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:34] (03CR) 10Dzahn: [C: 04-1] "well, style guide does not agree:) no (motd) role in non-role classes and also profile includes non-profile class" [puppet] - 10https://gerrit.wikimedia.org/r/730863 (owner: 10Dzahn) [17:38:09] (03PS2) 10Dzahn: builder/systemtap: merge role::systemtap::devserver into builder [puppet] - 10https://gerrit.wikimedia.org/r/730863 [17:39:36] (03PS4) 10Urbanecm: growthexperiments: Run refreshLinkRecommendations in parallel [puppet] - 10https://gerrit.wikimedia.org/r/730752 (https://phabricator.wikimedia.org/T278103) [17:40:31] (03CR) 10Volans: "LGTM, just few nits inline and can take advantage of the newest spicerack features ;)" [cookbooks] - 10https://gerrit.wikimedia.org/r/730506 (owner: 10Jbond) [17:42:21] !log depool mw1452 for training [17:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:48] rzl: training? Never heard of training the appservers [17:43:05] They've gotta know how to run MediaWiki [17:45:47] !log rzl@cumin1001 conftool action : set/pooled=no; selector: name=mw1452.eqiad.wmnet [17:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:27] !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [17:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:56] (03CR) 10Volans: [C: 04-1] "Left some comment inline, I'm not convinced by some changes." [cookbooks] - 10https://gerrit.wikimedia.org/r/730513 (owner: 10Jbond) [17:50:14] (03CR) 10Andrew Bogott: [C: 03+1] dynamicproxy: remove absented cron code [puppet] - 10https://gerrit.wikimedia.org/r/726730 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:52:20] !log repooled mw1452 (with `sudo pool` so no auto log from conftool) [17:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:57] (03Abandoned) 10Andrew Bogott: Openstack: support multiple regions [software/cumin] - 10https://gerrit.wikimedia.org/r/477811 (https://phabricator.wikimedia.org/T208861) (owner: 10Andrew Bogott) [17:53:47] (03PS4) 10Nray: Add new 'mediawiki.skin_diff' event logging stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725161 (https://phabricator.wikimedia.org/T289622) [17:54:24] (03CR) 10Andrew Bogott: [C: 03+2] toolforge: drop legacy webservice endpoints on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/715701 (owner: 10Majavah) [17:54:41] !log bd808@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'toolhub' for release 'main' . [17:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:13] (03CR) 10Addshore: [C: 03+2] "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/730866 (https://phabricator.wikimedia.org/T293385) (owner: 10Michael Große) [17:55:19] (03CR) 10Michael Große: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/730866 (https://phabricator.wikimedia.org/T293385) (owner: 10Michael Große) [17:59:01] (03CR) 10Bstorm: [C: 03+2] toolforge: wheel of misfortune: dry run on buster [puppet] - 10https://gerrit.wikimedia.org/r/729593 (https://phabricator.wikimedia.org/T282949) (owner: 10Majavah) [18:00:05] RoanKattouw and Urbanecm: Dear deployers, time to do the UTC evening backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211014T1800). [18:00:05] Juan_90264, Seddon, and nray: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:15] I can deploy today! [18:00:26] (03PS1) 10Dzahn: peek: replace crons with timers [puppet] - 10https://gerrit.wikimedia.org/r/730867 (https://phabricator.wikimedia.org/T273673) [18:00:27] thank you urbanecm o/ [18:00:37] Juan_90264: Seddon: hi, are you around? [18:00:47] (03Merged) 10jenkins-bot: Increase concurrency for wikibase-InjectRCRecords job [deployment-charts] - 10https://gerrit.wikimedia.org/r/730866 (https://phabricator.wikimedia.org/T293385) (owner: 10Michael Große) [18:00:57] (03CR) 10jerkins-bot: [V: 04-1] peek: replace crons with timers [puppet] - 10https://gerrit.wikimedia.org/r/730867 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [18:01:17] @urbanecm Around! [18:01:20] thanks! [18:01:30] !log addshore@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [18:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:47] Seddon: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaSearch/+/727216 is supposed to be backported to wmf.4 only, right? [18:02:03] (03CR) 10Urbanecm: [C: 03+2] Add new 'mediawiki.skin_diff' event logging stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725161 (https://phabricator.wikimedia.org/T289622) (owner: 10Nray) [18:02:08] (03PS1) 10Urbanecm: Fix assessment quickview labels [extensions/MediaSearch] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730734 (https://phabricator.wikimedia.org/T292596) [18:02:32] !log addshore@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [18:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:03] (03Merged) 10jenkins-bot: Add new 'mediawiki.skin_diff' event logging stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725161 (https://phabricator.wikimedia.org/T289622) (owner: 10Nray) [18:03:25] @urbanecm yeah it's only commons so I don't think there is any need to backport to wmf3 [18:03:33] nray: pulled to mwdebug1001 in case you want to test (patch discussion indicates it's a no-op, but not 100% sure though) [18:03:35] Seddon: ack, thanks. [18:03:40] (03CR) 10Urbanecm: [C: 03+2] Fix assessment quickview labels [extensions/MediaSearch] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730734 (https://phabricator.wikimedia.org/T292596) (owner: 10Urbanecm) [18:03:44] (03PS2) 10Dzahn: peek: replace crons with timers [puppet] - 10https://gerrit.wikimedia.org/r/730867 (https://phabricator.wikimedia.org/T273673) [18:03:45] !log addshore@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [18:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:53] urbanecm: Thank you I'll take a look just in case [18:03:59] thanks [18:04:12] Juan_90264: around? [18:04:45] (03Abandoned) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [18:05:00] (03CR) 10Dzahn: [C: 03+2] dynamicproxy: remove absented cron code [puppet] - 10https://gerrit.wikimedia.org/r/726730 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [18:05:20] urbanecm: haha yeah remedial education for mw1452, a slow learner [18:05:26] (new SRE training :D) [18:05:37] urbanecm: Things look good, you can proceed [18:05:37] ah, i see :) [18:05:43] nray: thanks, syncing. [18:06:21] (03CR) 10Andrew Bogott: [C: 03+2] cinderutils::ensure: give more info when no device found [puppet] - 10https://gerrit.wikimedia.org/r/730005 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [18:07:17] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 41baa8c41d64510986f009b9be2d70dad0915f8c: Add new mediawiki.skin_diff event logging stream (T289622) (duration: 01m 05s) [18:07:23] nray: should be live, enjoy! [18:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:24] T289622: Create new stream to log events from VectorPrefDiffInstrumentation - https://phabricator.wikimedia.org/T289622 [18:07:31] thank you urbanecm ! [18:07:34] any time [18:07:44] waiting for CI to finish to go with Seddon's patch [18:08:23] (03PS1) 10Urbanecm: Enable Growth's mentor dashboard backend on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730868 (https://phabricator.wikimedia.org/T278920) [18:10:45] Sorry I'm late, I arrived after a setback [18:11:25] (03CR) 10Urbanecm: [C: 03+2] Enable Growth's mentor dashboard backend on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730868 (https://phabricator.wikimedia.org/T278920) (owner: 10Urbanecm) [18:11:33] urbanecm:now i am present [18:11:38] let me push this patch first, and then I'll work on your patches Juan_90264 [18:12:16] Okay [18:12:19] (03Merged) 10jenkins-bot: Enable Growth's mentor dashboard backend on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730868 (https://phabricator.wikimedia.org/T278920) (owner: 10Urbanecm) [18:12:59] Juan_90264: any reason why you used 104/105 for the ID? Usually, they start from 100, so that's why i'm asking [18:13:03] not an issue, just wondering [18:13:12] (03CR) 10Volans: "replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/730440 (owner: 10Jbond) [18:14:44] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 262e588b44f126fb9e1aa933a3ca59b191b42bd7: Enable Growth mentor dashboard backend on all wikis (T278920) (duration: 01m 05s) [18:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:51] T278920: Mentor dashboard: V1 desktop - https://phabricator.wikimedia.org/T278920 [18:16:01] (03PS1) 10Dzahn: miscweb: downgrade staging to 2021-10-12-182149-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730869 [18:16:25] (03PS10) 10Urbanecm: Add $wgSitename and $wgMetaNamespace for kswiki and kswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720320 (https://phabricator.wikimedia.org/T289752) (owner: 10Rishabhbhat) [18:16:27] Urbanecm: I used 104/105 to leave 100/101 vacant in case they add the Portal namespace, and leave 102/103 vacant in case a WikiProject namespace [18:16:30] (03CR) 10Legoktm: [C: 03+2] growthexperiments: Run refreshLinkRecommendations in parallel [puppet] - 10https://gerrit.wikimedia.org/r/730752 (https://phabricator.wikimedia.org/T278103) (owner: 10Urbanecm) [18:16:35] (03CR) 10Urbanecm: [C: 03+2] Add $wgSitename and $wgMetaNamespace for kswiki and kswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720320 (https://phabricator.wikimedia.org/T289752) (owner: 10Rishabhbhat) [18:16:41] Juan_90264: ack, makes sense [18:16:45] thanks [18:17:13] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [18:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:51] (03Merged) 10jenkins-bot: Add $wgSitename and $wgMetaNamespace for kswiki and kswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720320 (https://phabricator.wikimedia.org/T289752) (owner: 10Rishabhbhat) [18:19:06] (03CR) 10Dzahn: [C: 03+2] "trying this to see if i still run into the 2m timeouts when pulling from registry" [deployment-charts] - 10https://gerrit.wikimedia.org/r/730869 (owner: 10Dzahn) [18:19:26] Juan_90264: your first patch is at mwdebug1001, can you test please? [18:20:26] urbanecm: should I stop the currently running mediawiki_job_growthexperiments-refreshLinkRecommendations.service ? [18:20:34] legoktm: yes please. [18:20:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:21:04] done [18:21:08] thanks [18:23:03] urbanecm: I tested and approved [18:23:24] (03Merged) 10jenkins-bot: Fix assessment quickview labels [extensions/MediaSearch] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730734 (https://phabricator.wikimedia.org/T292596) (owner: 10Urbanecm) [18:23:29] thanks Juan_90264 [18:24:13] (03Merged) 10jenkins-bot: miscweb: downgrade staging to 2021-10-12-182149-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730869 (owner: 10Dzahn) [18:25:43] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:25:44] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 0bccd4bc45498db8628567574d0bb3a23f8fb378: Add $wgSitename and $wgMetaNamespace for kswiki and kswiktionary (T289752, T289767) (duration: 01m 04s) [18:25:49] Juan_90264: first patch is ready [18:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:51] T289767: Change name and Wiktionary: namespace of Kashmiri Wiktionary - https://phabricator.wikimedia.org/T289767 [18:25:51] T289752: Change Ks Wikipedia sitename (wgMetaNamespace) from Wikipedia to وِکیٖپیٖڈیا - https://phabricator.wikimedia.org/T289752 [18:25:57] Okay [18:27:05] Seddon: your patch is at mwdebug1001, can you test please? [18:27:11] (03PS4) 10Urbanecm: Create Salima namespace for dagwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730579 (https://phabricator.wikimedia.org/T289911) (owner: 10Juan90264) [18:27:17] (03CR) 10Urbanecm: [C: 03+2] Create Salima namespace for dagwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730579 (https://phabricator.wikimedia.org/T289911) (owner: 10Juan90264) [18:28:14] (03Merged) 10jenkins-bot: Create Salima namespace for dagwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730579 (https://phabricator.wikimedia.org/T289911) (owner: 10Juan90264) [18:28:38] 10SRE, 10Wikimedia-Mailing-lists, 10cloud-services-team (Kanban): auto-subscribe cloud-vps and/or toolforge users to cloud-announce - https://phabricator.wikimedia.org/T278361 (10Legoktm) >>! In T278361#6962981, @Legoktm wrote: > Filed {T279023}. I would like to see what access control options exist on the m... [18:30:30] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [18:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:05] Seddon: how is it going? [18:32:25] Juan_90264: your second patch is at mwdebug1001 too, can you test? [18:32:47] Yes, i can [18:34:30] urbanecm: testing [18:34:44] thanks, let me know how it goes [18:35:20] urbanecm: I tested and approved [18:35:36] @urbanecm confirmed! Good to go [18:35:40] thanks both! syncing. [18:36:57] global renames are stuck (T293403), could someone take a look at that. [18:36:58] T293403: Global renames are stuck - https://phabricator.wikimedia.org/T293403 [18:37:06] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: c8dffefd0d095abe3709dcc962d5d24f27b55869: Create Salima namespace for dagwiki (T289911) (duration: 01m 04s) [18:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:12] T289911: Create Salima namespace at dag.wikipedia.org - https://phabricator.wikimedia.org/T289911 [18:37:15] zabe: I will do so once i finish B&C [18:37:27] ok, thx [18:40:02] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.4/extensions/MediaSearch/extension.json: 6da3523daaba85a4199721980c0a9c96b20697e7: Fix assessment quickview labels (T292596) (duration: 01m 03s) [18:40:03] Seddon: should be live. Anything else? [18:40:06] (03PS1) 10Dzahn: miscweb: downgrade staging to 2021-09-03-124355-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730876 [18:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:08] T292596: [S] [wmf.2] MediaSearch QuickView: MediaWiki names labels displayed for community assessment message text - https://phabricator.wikimedia.org/T292596 [18:40:14] (03CR) 10jerkins-bot: [V: 04-1] miscweb: downgrade staging to 2021-09-03-124355-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730876 (owner: 10Dzahn) [18:41:13] urbanecm: nope all good [18:41:17] great" [18:41:32] !log UTC evening B&C done [18:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:49] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=metawiki --logwiki=metawiki 'George Dum Fulton' 'George Fulton' # T293403 [18:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:56] T293403: Global renames are stuck - https://phabricator.wikimedia.org/T293403 [18:43:41] urbanecm: I think the rename thing is a job queue issue [18:43:48] (03PS2) 10Dzahn: miscweb: downgrade staging to 2021-09-03-124355-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730876 [18:44:00] majavah: I wouldn't be so sure about that, because of https://login.wikimedia.org/wiki/Special:GlobalRenameProgress/%C7%B1oo [18:44:06] I'm going to reopen T219279 [18:44:07] T219279: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 [18:44:43] ah, I was going with T293385 which looked related [18:44:44] T293385: Several jobs (incl. recentChangesUpdate, wikibase-InjectRCRecords) accumulating backlog since 2021-10-14 14:47 UTC - https://phabricator.wikimedia.org/T293385 [18:44:51] or that [18:45:02] 10SRE, 10MediaWiki-General, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), and 5 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Urbanecm) 05Resolved→03Open >>! In T219279#7403482, @Pchelolo wrote:... [18:45:12] Pchelolo: ^^ [18:47:18] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=frwiktionary --logwiki=metawiki 'TURK FASTER' 'ARTHUR MORGAN' [18:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:29] Urbanecm: Both working, thanks! [18:47:57] (03CR) 10Dzahn: [C: 03+2] miscweb: downgrade staging to 2021-09-03-124355-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730876 (owner: 10Dzahn) [18:48:04] majavah: i think you're right. I kicked TRUK FASTER rename, one rename ran quickly, but the other ones are stalled. [18:48:43] i'll kick them all, the task is closed, so...hopefully should be fine [18:48:57] (03PS1) 10Andrew Bogott: codfw1dev.wikimediacloud.org: Add new hostnames for tls openstack endpoints [dns] - 10https://gerrit.wikimedia.org/r/730879 (https://phabricator.wikimedia.org/T267194) [18:49:18] (03CR) 10Andrew Bogott: [C: 03+1] "associated dns patch is https://gerrit.wikimedia.org/r/c/operations/dns/+/730879" [puppet] - 10https://gerrit.wikimedia.org/r/728260 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [18:50:01] urbanecm: afaik it's closed because addshore worked around it for the jobs he needed, not because the cause was fixed [18:50:08] aha [18:50:21] The jobqueue issue would explain the fact they were not completly stucked. Some of them did got renamed in some wikis very slowly. [18:50:27] as far as we could tell the cause was just lots of jobs, and this should be expected for the low traffic jobs [18:50:45] so "just wait" is the ideal solution here addshore ? [18:50:52] and the job we had with our recent refactorings we decided should have been in the higher queue [18:50:53] do you know where the "lots of jobs" came from? [18:51:02] *looks at the boards again* [18:51:51] (03Merged) 10jenkins-bot: miscweb: downgrade staging to 2021-09-03-124355-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730876 (owner: 10Dzahn) [18:52:41] (03CR) 10Andrew Bogott: [C: 03+1] "One important thing to note here -- this patch will only provide an endpoint for the 'public' and 'internal' keystone endpoints. There's a" [puppet] - 10https://gerrit.wikimedia.org/r/728260 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [18:53:11] !log [urbanecm@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki=dagwiki --fix [18:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:23] GlobalUserPageLocalJobSubmitJob (low-traffic-jobs) looks suspicious [18:53:23] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [18:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:47] mean runtime since 14:57 has been 5 mins [18:53:54] addshore: mostly if we have tons of LocalGlobalUserPageCacheUpdateJob coming in and we don't know why, relying on them to stop at some point might not be enough [18:54:21] yeah, that looks like it would be clogging up the low traffic job queue [18:54:50] https://github.com/wikimedia/mediawiki-extensions-GlobalUserPage/commit/3376b1906a0faad9bd1e299a7cf45a81c900c6a5 might be related [18:55:00] per https://gerrit.wikimedia.org/g/mediawiki/extensions/GlobalUserPage/+/3376b1906a0faad9bd1e299a7cf45a81c900c6a5/includes/Hooks.php#84, it runs on link updates and page edits [18:55:12] oh yeah, if that went out with the train this week that could be it [18:55:22] cc legoktm ^ [18:55:42] actually that one was backported to wmf.3 too [18:56:02] GlobalUserPageLocalJobSubmitJob jobs get queued as pages get edited [18:56:35] that job has a 3.5 hour or so backlog currently [18:57:29] (03PS1) 10Dzahn: miscweb: upgrade staging to 2021-10-12-182149-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730882 [18:57:40] also all that job does is queue more jobs [18:57:52] addshore: link to a dashboard? [18:57:57] legoktm: or on https://gerrit.wikimedia.org/g/mediawiki/core/+/df9a903d494084b4a34e0b30270c0d0c5bf6cd6e/includes/deferred/LinksUpdate.php#174, which...likely happens on nulledits too? [18:58:08] https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=GlobalUserPageLocalJobSubmitJob [18:58:20] this one is 2.2 hour lag, one of the jobs i guess it queues is 3.5 [18:58:38] but yeah, these are the jobs clogging up the low traffic stuf [18:58:47] (03PS2) 10Dzahn: miscweb: upgrade staging to 2021-10-12-182149-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730882 [18:58:50] (03CR) 10Dzahn: [C: 03+2] miscweb: upgrade staging to 2021-10-12-182149-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730882 (owner: 10Dzahn) [18:59:14] but GlobalUserPageLocalJobSubmitJob mean runtime is 5 mins? [18:59:28] if all it does is submit other jobs, that reminds me of a ticket I have seen quite a lot recently [18:59:36] wait, each job is taking 5 minutes to run??? [18:59:41] it should take like, 10 seconds [18:59:49] *finds the ticket* [19:00:02] https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?viewPanel=3&orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=GlobalUserPageLocalJobSubmitJob&from=now-7d&to=now [19:00:05] dancy and brennen: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211014T1900). [19:00:06] https://phabricator.wikimedia.org/T292048 ? [19:00:06] seems like a recent regression [19:01:01] also, the flat 5 min seems more like a timeout being hit [19:01:08] yup [19:01:35] https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?viewPanel=3&orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=GlobalUserPageLocalJobSubmitJob&from=now-30d&to=now [19:01:42] the 5 min thing was happening previously too [19:01:59] Greetings [19:02:07] urbanecm: right, basically the point of the job is to relay the LinksUpdate across all wikis [19:02:31] yeah, i was just saying that someone purging a lot could cause this in theory [19:03:19] it shouldn't still take 5 minutes (or more) to run [19:03:46] im dahsing out, but will be back in 20 mins [19:04:10] (03Merged) 10jenkins-bot: miscweb: upgrade staging to 2021-10-12-182149-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730882 (owner: 10Dzahn) [19:04:42] Hey folks. Anything going on that should block rolling train to group2? [19:05:11] ooh, I see something in #wikimedia-releng [19:05:43] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [19:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:27] dancy: we're looking into a job queue issue, although I don't think it's caused by this train [19:06:37] thx. [19:06:40] urbanecm: do we log manual purges somewhere? [19:07:17] not directly IIRC [19:07:27] you can check channel:ratelimit for purge [19:09:00] I commented at https://phabricator.wikimedia.org/T292048#7429629 <-- cc: addshore [19:09:19] (03PS1) 10Dzahn: miscweb: upgrade staging to 2021-10-12-192016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730884 [19:09:37] (03CR) 10jerkins-bot: [V: 04-1] miscweb: upgrade staging to 2021-10-12-192016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730884 (owner: 10Dzahn) [19:09:38] urbanecm: I'm not seeing anything from meta there [19:09:40] (03PS2) 10Dzahn: miscweb: upgrade staging to 2021-10-12-192016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730884 [19:09:45] (03CR) 10Dzahn: [C: 03+2] miscweb: upgrade staging to 2021-10-12-192016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730884 (owner: 10Dzahn) [19:10:47] I'm still wondering why we started only seeing issues today [19:12:53] I don't think it started today, if you look at the 30 day view it was happening previously too [19:16:32] (03Merged) 10jenkins-bot: miscweb: upgrade staging to 2021-10-12-192016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730884 (owner: 10Dzahn) [19:20:35] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS1299/IPv4: Idle - Telia, AS1299/IPv6: Idle - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:23:13] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:23:33] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 68, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:23:50] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [19:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:59] Telia again.. hrmm [19:25:37] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:27:46] (Traffic on tunnel link) firing: Traffic on tunnel link - https://alerts.wikimedia.org [19:31:31] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:37:46] (Traffic on tunnel link) resolved: Traffic on tunnel link - https://alerts.wikimedia.org [19:40:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={redis_gitlab,sidekiq,workhorse} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:43:05] (03PS1) 10Ahmon Dancy: group2 wikis to 1.38.0-wmf.4 refs T281168 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730890 [19:43:07] (03CR) 10Ahmon Dancy: [C: 03+2] group2 wikis to 1.38.0-wmf.4 refs T281168 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730890 (owner: 10Ahmon Dancy) [19:43:19] RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 327, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:43:51] (03Merged) 10jenkins-bot: group2 wikis to 1.38.0-wmf.4 refs T281168 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730890 (owner: 10Ahmon Dancy) [19:44:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:45:20] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.38.0-wmf.4 refs T281168 [19:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:26] T281168: 1.38.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T281168 [20:11:32] 10SRE, 10SRE-OnFire, 10observability, 10Patch-For-Review, 10User-jbond: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display - https://phabricator.wikimedia.org/T285569 (10CDanis) [20:11:58] 10SRE, 10SRE-OnFire, 10observability, 10Patch-For-Review, 10User-jbond: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display - https://phabricator.wikimedia.org/T285569 (10CDanis) a:03CDanis [20:12:12] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10CAS-SSO, 10User-jbond: thanos u/i gives errors if left idle for a few hours - https://phabricator.wikimedia.org/T268233 (10Krinkle) Some partial notes from a partial investigation: While using Thanos (and presumably Grafana as well) each lo... [20:12:45] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10CAS-SSO, 10User-jbond: Thanos, Grafana, etc. break session after an hour - https://phabricator.wikimedia.org/T268233 (10Krinkle) [20:13:01] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10CAS-SSO, 10User-jbond: Thanos and Grafana lose the session after an hour - https://phabricator.wikimedia.org/T268233 (10Krinkle) [20:13:03] (03PS1) 10Cwhite: logstash: duplicate MediaWiki error,fatal,exception logs to ECS test [puppet] - 10https://gerrit.wikimedia.org/r/730897 (https://phabricator.wikimedia.org/T234565) [20:35:29] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T293053 (10CDanis) 05Open→03Resolved a:03CDanis Access granted! [20:38:11] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10CDanis) @Ottomata or @odimitrijevic can you please approve? Thanks! [20:38:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lbowmaker - https://phabricator.wikimedia.org/T293241 (10CDanis) @Ottomata or @odimitrijevic can you please approve? Thanks! [20:39:45] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lbowmaker - https://phabricator.wikimedia.org/T293241 (10Ottomata) Approved. I think Luke will probably want/need shell access eventually, so if he's willing to generate an ssh key, let's get him full ssh and Kerberos too. [20:40:22] (03PS1) 10Krinkle: logging: Remove host 'ip' field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730905 (https://phabricator.wikimedia.org/T114700) [20:40:49] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10Ottomata) Approved. ssh keys + kerberos good. [20:41:26] (03CR) 10Krinkle: logstash: duplicate MediaWiki error,fatal,exception logs to ECS test (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/730897 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [20:42:49] (03CR) 10Krinkle: logstash: duplicate MediaWiki error,fatal,exception logs to ECS test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730897 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [20:48:39] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10CDanis) Thanks Andrew! @DAbad if you want shell access, please generate and paste a public key: https://wikitech.wikimedia.org/wiki/SRE/Production_access#Generating_your_SSH_key [21:06:59] !log robh@cumin1001 START - Cookbook sre.dns.netbox [21:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:19] !log robh@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:55] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:17:59] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:20:41] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:25:07] !log robh@cumin1001 START - Cookbook sre.dns.netbox [21:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:42] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:59] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:57:36] (03PS1) 10Legoktm: [WIP] mediawiki: Disable mod_unique_id [puppet] - 10https://gerrit.wikimedia.org/r/730923 (https://phabricator.wikimedia.org/T253675) [21:58:33] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31706/console" [puppet] - 10https://gerrit.wikimedia.org/r/730923 (https://phabricator.wikimedia.org/T253675) (owner: 10Legoktm) [22:17:40] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - ryankemper@cumin1001 - T292814 [22:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:46] T292814: Service restarts of cloudelastic for Java security updates (Aug 2021) - https://phabricator.wikimedia.org/T292814 [22:23:38] !log dpifke@deploy1002 Started deploy [performance/arc-lamp@84fe496]: New flamegraph.pl from upstream T291898 [22:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:43] !log dpifke@deploy1002 Finished deploy [performance/arc-lamp@84fe496]: New flamegraph.pl from upstream T291898 (duration: 00m 05s) [22:23:44] T291898: Arc Lamp: Update copy of FlameGraph (Support permalink to filtered view) - https://phabricator.wikimedia.org/T291898 [22:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:43] (03PS1) 10Zabe: Add americanantiquarian.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730933 (https://phabricator.wikimedia.org/T292918) [22:26:07] (03CR) 10RLazarus: "LGTM but let me get that Jenkins error sorted out first, unless this is urgent -- thanks for uncovering it, sorry for the inconvenience" [software/httpbb] - 10https://gerrit.wikimedia.org/r/640256 (owner: 10CDanis) [22:28:52] !log T288231 `ryankemper@wdqs2005:~$ sudo pool`: transfer completed successfully; tests passing on host (used `ssh -L 9999:localhost:80 wdqs2005.codfw.wmnet` to establish tunnel) [22:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:58] T288231: Deploy the wdqs streaming updater to production - https://phabricator.wikimedia.org/T288231 [22:30:26] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: enable the streaming updater on wdqs2006 [puppet] - 10https://gerrit.wikimedia.org/r/730795 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [22:31:10] !log depooling mw1452 for testig [22:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:47] !log T288231 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/730795; proceeding to data-transfer on `wdqs2006`: `sudo rm -fv /srv/wdqs/data_loaded` on `wdqs2006` followed by `ryankemper@cumin1001:~$ sudo cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2006.codfw.wmnet --reason "streaming updater cutover for wdqs2005" --blazegraph_instance blazegraph --task-id T288231` [22:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:16] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:35] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [22:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:48] !log T288231 Forgot about running puppet-agent on `wdqs2006`; aborted cookbook run [22:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:00] !log T288231 Ran puppet on `wdqs2006`, now back to the cookbook run [22:35:02] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:06] T288231: Deploy the wdqs streaming updater to production - https://phabricator.wikimedia.org/T288231 [22:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_streaming_updater site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:43:44] (03Restored) 10Nray: Change VectorPrefDiffInstrumentation stream name to `mediawiki.skin_diff` [extensions/WikimediaEvents] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730733 (https://phabricator.wikimedia.org/T289622) (owner: 10Nray) [22:43:50] (03PS1) 10Zabe: allow sysops to add and remove users to other groups on ptwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730936 (https://phabricator.wikimedia.org/T292806) [22:48:08] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T253675#7430130" [puppet] - 10https://gerrit.wikimedia.org/r/730923 (https://phabricator.wikimedia.org/T253675) (owner: 10Legoktm) [22:49:35] ACKNOWLEDGEMENT - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_streaming_updater site=codfw Ryan Kemper related to https://phabricator.wikimedia.org/T288231 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:52:15] (03Abandoned) 10Legoktm: [WIP] mediawiki: Disable mod_unique_id [puppet] - 10https://gerrit.wikimedia.org/r/730923 (https://phabricator.wikimedia.org/T253675) (owner: 10Legoktm) [23:00:04] brennen: Dear deployers, time to do the UTC late backport and config training deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211014T2300). [23:00:04] zabe and nray: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:13] o/ [23:00:18] o/ [23:00:21] o/ [23:02:54] o/ [23:03:07] going ahead with the first config patch [23:03:45] (03CR) 10Thcipriani: [C: 03+2] "BACKPORT" [extensions/WikimediaEvents] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730733 (https://phabricator.wikimedia.org/T289622) (owner: 10Nray) [23:03:58] (03CR) 10Brennen Bearnes: [C: 03+2] Add americanantiquarian.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730933 (https://phabricator.wikimedia.org/T292918) (owner: 10Zabe) [23:04:45] (03Merged) 10jenkins-bot: Add americanantiquarian.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730933 (https://phabricator.wikimedia.org/T292918) (owner: 10Zabe) [23:06:06] zabe: want to test on mwdebug1002? [23:06:25] (03PS4) 10Juan90264: Change Kashmiri Wiktionary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730736 (https://phabricator.wikimedia.org/T293373) [23:06:32] (03Merged) 10jenkins-bot: Change VectorPrefDiffInstrumentation stream name to `mediawiki.skin_diff` [extensions/WikimediaEvents] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730733 (https://phabricator.wikimedia.org/T289622) (owner: 10Nray) [23:07:17] not really. The only way to test this patch is by uploading an image to commons, which I don't really want to do. IMO the patch is trivial and it can be merged directly. [23:07:47] * deployed [23:09:03] you can upload a new version File:Test which was deleted in 2007. /me hides [23:09:45] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [23:09:50] zabe: ack, commons loads on mwdebug, syncing. [23:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:19] !log mw1452 - re-pooled, scap pull [23:11:19] I could upload one, but I would need to select one from that specific website and tbh I am fairly unfamiliar with the licences of those old books. [23:11:21] !log brennen@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:730933|Add americanantiquarian.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T292918)]] (duration: 00m 57s) [23:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:29] T292918: Add americanantiquarian.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T292918 [23:11:43] zabe: oh, I see now, testing uploadDomains, ack [23:11:45] (03PS2) 10Brennen Bearnes: allow sysops to add and remove users to other groups on ptwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730936 (https://phabricator.wikimedia.org/T292806) (owner: 10Zabe) [23:12:03] zabe: yea, good enough if it doesnt break, we will find out soon enough once people use that [23:15:50] (03PS1) 10Dzahn: miscweb: upgrade prod to 2021-10-12-192016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730938 [23:17:43] (03CR) 10Clare Ming: [C: 03+2] allow sysops to add and remove users to other groups on ptwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730936 (https://phabricator.wikimedia.org/T292806) (owner: 10Zabe) [23:18:12] (03CR) 10Dzahn: [C: 03+2] miscweb: upgrade prod to 2021-10-12-192016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730938 (owner: 10Dzahn) [23:18:15] But I would like to test this one on mwdebug :) [23:18:43] (03PS2) 10Cwhite: logstash: duplicate MediaWiki error,fatal,exception logs to ECS test [puppet] - 10https://gerrit.wikimedia.org/r/730897 (https://phabricator.wikimedia.org/T234565) [23:19:37] (03PS3) 10Cwhite: logstash: duplicate MediaWiki error and exception logs to ECS test [puppet] - 10https://gerrit.wikimedia.org/r/730897 (https://phabricator.wikimedia.org/T234565) [23:19:41] (03Merged) 10jenkins-bot: allow sysops to add and remove users to other groups on ptwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730936 (https://phabricator.wikimedia.org/T292806) (owner: 10Zabe) [23:21:06] (03CR) 10Cwhite: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/730897 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [23:21:56] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - ryankemper@cumin1001 - T292814 [23:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:04] T292814: Service restarts of cloudelastic for Java security updates (Aug 2021) - https://phabricator.wikimedia.org/T292814 [23:22:10] zabe - can you test on mwdebug1002? [23:22:34] doing [23:23:32] patch works the suppossed way [23:24:20] going live [23:24:58] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:730936|allow sysops to add and remove users to other groups on ptwikivoyage (T292806)]] (duration: 00m 56s) [23:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:04] T292806: allow sysops to add users to other groups on ptwikivoyage - https://phabricator.wikimedia.org/T292806 [23:28:16] Thanks for your help :) [23:28:58] hi nray -- is there a way to check? mwdebug1002 [23:29:25] (03PS3) 10Juan90264: Change Kashmiri Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730737 (https://phabricator.wikimedia.org/T293342) [23:29:38] cjming: hi, good to see you :). Yeah, I'll do a quick check [23:31:48] (03CR) 10Dzahn: [V: 03+1 C: 03+2] miscweb: upgrade prod to 2021-10-12-192016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730938 (owner: 10Dzahn) [23:32:02] cjming: things look good, you can proceed! [23:32:15] cool - syncing now [23:32:31] brennen: Do not forget me :) [23:32:54] More two patches [23:33:47] Hello? [23:34:05] !log cjming@deploy1002 Synchronized php-1.38.0-wmf.4/extensions/WikimediaEvents/includes/VectorPrefDiffInstrumentation.php: Backport: [[gerrit:730733|Change VectorPrefDiffInstrumentation stream name to `mediawiki.skin_diff` (T289622)]] (duration: 00m 56s) [23:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:12] T289622: Create new stream to log events from VectorPrefDiffInstrumentation - https://phabricator.wikimedia.org/T289622 [23:34:22] thank you cjming ! [23:34:55] Any deployers available? [23:35:19] cjming: ? [23:35:41] we're here - we'll merge your 1st patch [23:36:06] Okay [23:36:12] (03CR) 10Clare Ming: [C: 03+2] Change Kashmiri Wiktionary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730736 (https://phabricator.wikimedia.org/T293373) (owner: 10Juan90264) [23:37:26] (03Merged) 10jenkins-bot: Change Kashmiri Wiktionary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730736 (https://phabricator.wikimedia.org/T293373) (owner: 10Juan90264) [23:37:42] Perfect merged [23:39:11] Juan_90264: can you test on mwdebug1002? [23:39:24] Yes, i can [23:41:21] I tested and approved [23:41:27] (03CR) 10Clare Ming: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730737 (https://phabricator.wikimedia.org/T293342) (owner: 10Juan90264) [23:43:14] Juan_90264: which page are you using to test? [23:44:29] nevermind :) [23:45:47] (03CR) 10Dzahn: [V: 03+2 C: 03+2] miscweb: upgrade prod to 2021-10-12-192016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730938 (owner: 10Dzahn) [23:46:21] What should I check here? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/730737/ [23:46:53] !log cjming@deploy1002 Synchronized static/images/project-logos: Config: [[gerrit:730736|Change Kashmiri Wiktionary logo (T293373)]] (duration: 00m 56s) [23:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:00] T293373: Requesting permanent logo change for ks.wiktionary.org - https://phabricator.wikimedia.org/T293373 [23:47:01] Juan_90264 syncing 730736 now [23:47:07] Okay [23:48:08] !log cjming@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:730736|Change Kashmiri Wiktionary logo (T293373)]] (duration: 00m 55s) [23:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:04] cjming: What should I check here? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/730737/ [23:49:16] !log cjming@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:730736|Change Kashmiri Wiktionary logo (T293373)]] (duration: 00m 55s) [23:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:53] (03PS4) 10Clare Ming: Change Kashmiri Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730737 (https://phabricator.wikimedia.org/T293342) (owner: 10Juan90264) [23:50:11] Juan_90264 rebasing 730737 now [23:50:28] Okay [23:50:54] (03CR) 10Clare Ming: [C: 03+2] Change Kashmiri Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730737 (https://phabricator.wikimedia.org/T293342) (owner: 10Juan90264) [23:52:05] (03PS2) 10Dzahn: miscweb: upgrade prod to 2021-10-12-192016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730938 [23:52:35] (03Merged) 10jenkins-bot: Change Kashmiri Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730737 (https://phabricator.wikimedia.org/T293342) (owner: 10Juan90264) [23:53:58] Juan_90264 you can check 730737 on mwdebug1002 now [23:53:59] Great merged [23:54:05] Yes, i can [23:55:15] cjming: I tested and approved [23:55:33] cool then syncing now [23:56:57] !log cjming@deploy1002 Synchronized static/images/project-logos: Config: [[gerrit:730737|Change Kashmiri Wikipedia logo (T293342)]] (duration: 00m 56s) [23:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:04] T293342: Requesting permanent logo change for ks.wikipedia.org - https://phabricator.wikimedia.org/T293342 [23:57:27] This has not yet updated https://ks.wiktionary.org/static/images/project-logos/kswiktionary.png [23:57:37] still syncing [23:58:01] !log cjming@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:730737|Change Kashmiri Wikipedia logo (T293342)]] (duration: 00m 55s) [23:58:04] oh sorry - the first one - still cached [23:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:10] (y) [23:59:10] !log cjming@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:730737|Change Kashmiri Wikipedia logo (T293342)]] (duration: 00m 55s) [23:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log