[00:57:41] (03PS1) 10Gergő Tisza: Enable GrowthExperiments image recommendations on ar,bn,cs,vi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736320 (https://phabricator.wikimedia.org/T294878) [01:12:28] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:53:14] !log milimetric@deploy1002 Started deploy [analytics/refinery@cf6095c]: Regular analytics weekly train [analytics/refinery@cf6095c] [01:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [02:08:50] PROBLEM - Check systemd state on wcqs2002 is CRITICAL: CRITICAL - degraded: The following units failed: session-111576.scope,user@112.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:13:18] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={GET,LIST,PATCH,PUT} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [02:13:46] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [02:14:04] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [02:14:25] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [02:15:44] !log milimetric@deploy1002 Finished deploy [analytics/refinery@cf6095c]: Regular analytics weekly train [analytics/refinery@cf6095c] (duration: 22m 30s) [02:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:16:14] !log milimetric@deploy1002 Started deploy [analytics/refinery@cf6095c] (thin): Regular analytics weekly train THIN [analytics/refinery@cf6095c] [02:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:16:21] !log milimetric@deploy1002 Finished deploy [analytics/refinery@cf6095c] (thin): Regular analytics weekly train THIN [analytics/refinery@cf6095c] (duration: 00m 07s) [02:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:16:32] !log milimetric@deploy1002 Started deploy [analytics/refinery@cf6095c] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@cf6095c] [02:16:32] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [02:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:34] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [02:18:18] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [02:20:06] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [02:22:08] !log milimetric@deploy1002 Finished deploy [analytics/refinery@cf6095c] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@cf6095c] (duration: 05m 36s) [02:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:05] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [03:06:28] 10SRE, 10Wikimedia-Mailing-lists: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066 (10Legoktm) p:05Low→03Medium Recent events have made it so that we should probably do this sooner instead of waiting. The one catch is that mail delivery is dependent upon the web... [04:24:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service,monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:56:49] In about 1h we'll switchover s1 (enwiki) master [05:01:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: Primary switchover s1 T293964 [05:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:22] T293964: Switchover s1 from db1163 to db1118 - https://phabricator.wikimedia.org/T293964 [05:01:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s1 T293964 [05:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:52] (03PS3) 10Marostegui: mariadb: Promote db1118 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/736114 (https://phabricator.wikimedia.org/T293964) [05:29:20] (03CR) 10Kormat: [C: 03+1] mariadb: Promote db1118 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/736114 (https://phabricator.wikimedia.org/T293964) (owner: 10Marostegui) [05:30:04] (03CR) 10Kormat: [C: 03+1] wmnet: Update s1-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/736115 (https://phabricator.wikimedia.org/T293964) (owner: 10Marostegui) [05:33:14] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1118 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/736114 (https://phabricator.wikimedia.org/T293964) (owner: 10Marostegui) [05:33:42] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (phab1001), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:46:11] In about 15 minutes we'll switchover s1 (enwiki) master [06:00:05] kormat and marostegui: #bothumor I � Unicode. All rise for Database primary switchover for s1 deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211103T0600). [06:00:11] \o/ [06:00:14] Let's go? [06:00:52] * kormat sighs in resignation [06:00:58] !log Starting s1 eqiad failover from db1163 to db1118 - T293964 [06:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:01] T293964: Switchover s1 from db1163 to db1118 - https://phabricator.wikimedia.org/T293964 [06:01:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s1 eqiad as read-only for maintenance - T293964', diff saved to https://phabricator.wikimedia.org/P17657 and previous config saved to /var/cache/conftool/dbconfig/20211103-060114-root.json [06:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:24] RO confirmed [06:02:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1118 to s1 primary and set section read-write T293964', diff saved to https://phabricator.wikimedia.org/P17658 and previous config saved to /var/cache/conftool/dbconfig/20211103-060201-root.json [06:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:05] All done [06:02:07] CHecking now [06:02:40] I can write fine [06:03:14] orchestrator needs the usual pt-heartbeat clean up [06:03:18] but other than that I think we are ok [06:03:25] i can handle that [06:03:30] thanks :* [06:04:23] I can see recentchanges increasing [06:04:24] done [06:05:12] merging dns [06:05:15] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s1-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/736115 (https://phabricator.wikimedia.org/T293964) (owner: 10Marostegui) [06:06:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1163 until it's reimaged to buster T293964', diff saved to https://phabricator.wikimedia.org/P17659 and previous config saved to /var/cache/conftool/dbconfig/20211103-060644-root.json [06:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [06:06:48] T293964: Switchover s1 from db1163 to db1118 - https://phabricator.wikimedia.org/T293964 [06:08:42] 48 seconds of RO time [06:09:33] i guess that'll have to do [06:09:41] noice [06:10:05] And only one more left :) [06:10:13] marostegui: you might want to stop replication on db1163 [06:10:16] (or i can) [06:10:18] !log Stop replication on db1163 T290865 [06:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:21] kormat: just did! [06:10:21] T290865: Upgrade s1 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T290865 [06:10:22] :D [06:11:28] (03PS1) 10Marostegui: db1163: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/736328 (https://phabricator.wikimedia.org/T290865) [06:12:02] (03CR) 10Marostegui: [C: 03+2] db1163: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/736328 (https://phabricator.wikimedia.org/T290865) (owner: 10Marostegui) [06:13:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1163.eqiad.wmnet with OS buster [06:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:17] (03PS3) 10Marostegui: dbbackups: Switch s1 backup generation from db1139 to db1140 [puppet] - 10https://gerrit.wikimedia.org/r/721286 (https://phabricator.wikimedia.org/T290865) (owner: 10Jcrespo) [06:21:55] (03CR) 10Marostegui: "Merging this as the switchover happened and s1 snapshot happened last night, so perfect timing!" [puppet] - 10https://gerrit.wikimedia.org/r/721286 (https://phabricator.wikimedia.org/T290865) (owner: 10Jcrespo) [06:22:16] (03CR) 10Marostegui: [C: 03+2] dbbackups: Switch s1 backup generation from db1139 to db1140 [puppet] - 10https://gerrit.wikimedia.org/r/721286 (https://phabricator.wikimedia.org/T290865) (owner: 10Jcrespo) [06:26:41] PROBLEM - puppet last run on wcqs2002 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:32:19] (03PS1) 10Urbanecm: Growth IP research survey: Fix coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736332 (https://phabricator.wikimedia.org/T294568) [06:32:30] jouncebot: nowandnext [06:32:31] No deployments scheduled for the next 4 hour(s) and 27 minute(s) [06:32:31] In 4 hour(s) and 27 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211103T1100) [06:32:39] (03CR) 10Urbanecm: [C: 03+2] Growth IP research survey: Fix coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736332 (https://phabricator.wikimedia.org/T294568) (owner: 10Urbanecm) [06:33:25] (03Merged) 10jenkins-bot: Growth IP research survey: Fix coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736332 (https://phabricator.wikimedia.org/T294568) (owner: 10Urbanecm) [06:35:09] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 34888b034e54ec35ca3b6745336fc0881e50c9b0: Growth IP research survey: Fix coverage (T294568) (duration: 01m 04s) [06:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:13] T294568: deploy quicksurvey for editors on eswiki and arwiki (for Growth IP editors research) - https://phabricator.wikimedia.org/T294568 [06:35:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [06:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [06:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1163.eqiad.wmnet with OS buster [06:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:17] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:46:01] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:46:35] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [06:48:21] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:26:45] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:28:23] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:37:15] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:38:49] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:41:01] 10ops-ulsfo: ps1-22-ulsfo Cord, Master_Cord_A, Active Power alerting - https://phabricator.wikimedia.org/T294891 (10ayounsi) p:05Triage→03Medium [07:50:19] !log Drop oauth2_access_tokens oauth_accepted_consumer oauth_registered_consumer from foundationwiki T294595 [07:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:23] T294595: Drop OAuth-related tables from foundationwiki - https://phabricator.wikimedia.org/T294595 [07:51:40] (03CR) 10Elukey: Add role::analytics_cluster::database::meta on an-db100[12] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736019 (https://phabricator.wikimedia.org/T284150) (owner: 10Ottomata) [07:57:59] !log elukey@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [07:57:59] !log elukey@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [07:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:00] ACKNOWLEDGEMENT - OSPF status on mr1-codfw is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP ayounsi https://phabricator.wikimedia.org/T294789 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove logpager replicas from s6 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P17660 and previous config saved to /var/cache/conftool/dbconfig/20211103-075801-marostegui.json [07:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:04] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [08:21:47] (Juniper alarm active) resolved: Juniper alarm active - https://alerts.wikimedia.org [08:39:15] (03PS7) 10Jbond: C:puppetmaster::gitclone: add support for r10k environments [puppet] - 10https://gerrit.wikimedia.org/r/736305 [08:40:02] (03CR) 10jerkins-bot: [V: 04-1] C:puppetmaster::gitclone: add support for r10k environments [puppet] - 10https://gerrit.wikimedia.org/r/736305 (owner: 10Jbond) [08:41:28] (03PS8) 10Jbond: C:puppetmaster::gitclone: add support for r10k environments [puppet] - 10https://gerrit.wikimedia.org/r/736305 [08:42:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32070/console" [puppet] - 10https://gerrit.wikimedia.org/r/736305 (owner: 10Jbond) [08:43:59] (03PS9) 10Jbond: C:puppetmaster::gitclone: add support for r10k environments [puppet] - 10https://gerrit.wikimedia.org/r/736305 [08:44:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32072/console" [puppet] - 10https://gerrit.wikimedia.org/r/736305 (owner: 10Jbond) [08:46:16] (03CR) 10Ayounsi: "NOOP on dns1001 and centrallog as well https://puppet-compiler.wmflabs.org/compiler1001/32071/" [puppet] - 10https://gerrit.wikimedia.org/r/735410 (owner: 10Ayounsi) [08:50:40] (03PS1) 10Kormat: WIP: mariadb: Set host monitoring to critical. [puppet] - 10https://gerrit.wikimedia.org/r/736415 [08:52:07] (03PS2) 10Kormat: WIP: mariadb: Set host monitoring to critical. [puppet] - 10https://gerrit.wikimedia.org/r/736415 [08:52:12] (03CR) 10Jelto: [C: 03+1] charts: bump common_templates to 0.4 and chart versions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/736227 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [08:53:41] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32074/console" [puppet] - 10https://gerrit.wikimedia.org/r/736415 (owner: 10Kormat) [08:55:15] !log Disable eqiad Equinix IXP peerings - T290877 [08:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:46] (03PS3) 10Kormat: WIP: mariadb: Set core host monitoring to critical. [puppet] - 10https://gerrit.wikimedia.org/r/736415 [08:59:48] (03PS10) 10Jbond: C:puppetmaster::gitclone: add support for r10k environments [puppet] - 10https://gerrit.wikimedia.org/r/736305 [09:00:27] (03CR) 10jerkins-bot: [V: 04-1] C:puppetmaster::gitclone: add support for r10k environments [puppet] - 10https://gerrit.wikimedia.org/r/736305 (owner: 10Jbond) [09:02:11] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 5 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32077/console" [puppet] - 10https://gerrit.wikimedia.org/r/736415 (owner: 10Kormat) [09:02:24] (03PS2) 10Jelto: charts: bump common_templates to 0.4 and chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/736227 (https://phabricator.wikimedia.org/T292390) [09:02:33] (03CR) 10jerkins-bot: [V: 04-1] charts: bump common_templates to 0.4 and chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/736227 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [09:03:14] (03PS3) 10Jelto: services: add support to deploy all services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/735979 (https://phabricator.wikimedia.org/T251305) [09:03:52] (03PS11) 10Jbond: C:puppetmaster::gitclone: add support for r10k environments [puppet] - 10https://gerrit.wikimedia.org/r/736305 [09:04:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32080/console" [puppet] - 10https://gerrit.wikimedia.org/r/736305 (owner: 10Jbond) [09:04:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32081/console" [puppet] - 10https://gerrit.wikimedia.org/r/736305 (owner: 10Jbond) [09:05:59] (03PS4) 10Kormat: mariadb: Set core db host monitoring to critical. [puppet] - 10https://gerrit.wikimedia.org/r/736415 (https://phabricator.wikimedia.org/T233684) [09:17:10] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:puppetmaster::gitclone: add support for r10k environments [puppet] - 10https://gerrit.wikimedia.org/r/736305 (owner: 10Jbond) [09:23:54] !log re-enable eqiad Equinix IXP peerings - T290877 [09:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:14] (03PS1) 10Jbond: puppetmaster::r10k: check for an actual file [puppet] - 10https://gerrit.wikimedia.org/r/736419 [09:29:38] (03CR) 10Jbond: [C: 03+2] puppetmaster::r10k: check for an actual file [puppet] - 10https://gerrit.wikimedia.org/r/736419 (owner: 10Jbond) [09:42:46] (03PS1) 10Vgutierrez: Release 8.0.8-1wm5 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/736420 (https://phabricator.wikimedia.org/T294897) [09:53:20] (03PS3) 10Giuseppe Lavagetto: php: allow installing multiple php versions at the same time [puppet] - 10https://gerrit.wikimedia.org/r/736276 (https://phabricator.wikimedia.org/T293450) [09:55:58] (03PS2) 10Ladsgroup: admin: Add my new ssh key [puppet] - 10https://gerrit.wikimedia.org/r/736311 [10:01:25] (03PS1) 10Jbond: C:puppetmaster: don't create hiera.yaml file for r10k users [puppet] - 10https://gerrit.wikimedia.org/r/736422 [10:02:03] (03CR) 10jerkins-bot: [V: 04-1] C:puppetmaster: don't create hiera.yaml file for r10k users [puppet] - 10https://gerrit.wikimedia.org/r/736422 (owner: 10Jbond) [10:03:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32085/console" [puppet] - 10https://gerrit.wikimedia.org/r/736422 (owner: 10Jbond) [10:04:09] (03PS2) 10Jbond: C:puppetmaster: don't create hiera.yaml file for r10k users [puppet] - 10https://gerrit.wikimedia.org/r/736422 [10:04:58] (03PS3) 10Jbond: C:puppetmaster: don't create hiera.yaml file for r10k users [puppet] - 10https://gerrit.wikimedia.org/r/736422 [10:06:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32087/console" [puppet] - 10https://gerrit.wikimedia.org/r/736422 (owner: 10Jbond) [10:06:34] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:puppetmaster: don't create hiera.yaml file for r10k users [puppet] - 10https://gerrit.wikimedia.org/r/736422 (owner: 10Jbond) [10:08:18] (03CR) 10Muehlenhoff: [C: 03+1] "Verified the key out of band via Slack, merging." [puppet] - 10https://gerrit.wikimedia.org/r/736311 (owner: 10Ladsgroup) [10:08:23] (03CR) 10Muehlenhoff: [C: 03+2] admin: Add my new ssh key [puppet] - 10https://gerrit.wikimedia.org/r/736311 (owner: 10Ladsgroup) [10:11:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [10:12:53] (03PS1) 10Jbond: r10k: create r10k production environment [puppet] - 10https://gerrit.wikimedia.org/r/736425 [10:14:05] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:16:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] add the Wikimedia Enterprise content downloader script [puppet] - 10https://gerrit.wikimedia.org/r/734622 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [10:18:17] the BFD alert is related to the Lumen transport link afaics [10:20:25] yeah seems to be [10:20:42] oddly I can ping the other side [10:22:52] BFD showing up again after "clear bfd session address " either side [10:22:57] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 2: Verified+1" [puppet] - 10https://gerrit.wikimedia.org/r/736201 (owner: 10David Caro) [10:23:09] ¯\_(ツ)_/¯ [10:24:03] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:25:15] topranks: it happened to me too a while ago, the problem was between eqiad and esams IIRC (with a clear session everything went back to normal) [10:25:52] yeah it's funny. I can see OSPF adjacency is up for 11 minutes on it too, which is before I cleared the BFD [10:26:00] (03CR) 10Hashar: "That one should be straightforward :)" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/734965 (owner: 10Hashar) [10:28:32] (03PS3) 10Jelto: charts: bump common_templates to 0.4 and chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/736227 (https://phabricator.wikimedia.org/T292390) [10:29:46] (03CR) 10Muehlenhoff: "One typo inline, the four included patches look good to me otherwise." [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/736420 (https://phabricator.wikimedia.org/T294897) (owner: 10Vgutierrez) [10:31:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikireplicas: add Translate extension tables [puppet] - 10https://gerrit.wikimedia.org/r/735088 (https://phabricator.wikimedia.org/T289952) (owner: 10AntiCompositeNumber) [10:32:53] (Traffic bill over quota) firing: (2) Traffic bill over quota - https://alerts.wikimedia.org [10:37:41] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wcqs_443: Servers wcqs2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:37:53] (Traffic bill over quota) firing: (3) Traffic bill over quota - https://alerts.wikimedia.org [10:37:57] PROBLEM - puppet last run on wcqs1001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:38:55] (03PS4) 10Giuseppe Lavagetto: php: allow installing multiple php versions at the same time [puppet] - 10https://gerrit.wikimedia.org/r/736276 (https://phabricator.wikimedia.org/T293450) [10:38:57] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wcqs_443: Servers wcqs2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:39:34] (03CR) 10Arturo Borrero Gonzalez: toolforge: new add_grid_webgrid_generic_node recipe (037 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [10:40:55] (03PS17) 10David Caro: ceph: introduce auth load abstraction [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [10:42:36] (03PS6) 10David Caro: DONOTMERGE toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) [10:43:25] (03CR) 10David Caro: [V: 03+1] ceph: introduce auth load abstraction (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [10:45:11] PROBLEM - SSH on wcqs2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:48:24] (03PS2) 10Vgutierrez: Release 8.0.8-1wm5 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/736420 (https://phabricator.wikimedia.org/T294897) [10:48:37] (03CR) 10Vgutierrez: Release 8.0.8-1wm5 (032 comments) [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/736420 (https://phabricator.wikimedia.org/T294897) (owner: 10Vgutierrez) [10:48:57] (03PS1) 10David Caro: ceph::auth: moved config to load_all.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/736435 [10:49:15] (03CR) 10David Caro: [V: 03+2 C: 03+2] ceph::auth: moved config to load_all.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/736435 (owner: 10David Caro) [10:50:31] (03PS1) 10Jbond: O:puppetmaster:r10k: make config more flexible [puppet] - 10https://gerrit.wikimedia.org/r/736436 [10:51:04] (03CR) 10jerkins-bot: [V: 04-1] O:puppetmaster:r10k: make config more flexible [puppet] - 10https://gerrit.wikimedia.org/r/736436 (owner: 10Jbond) [10:51:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32092/console" [puppet] - 10https://gerrit.wikimedia.org/r/736436 (owner: 10Jbond) [10:52:44] (03CR) 10Elukey: "Adding Ben for the AQS part (DE is currently working on migrating Cassandra to 3.x in a new cluster)" [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [10:52:53] (Traffic bill over quota) firing: (3) Traffic bill over quota - https://alerts.wikimedia.org [10:53:32] (03PS2) 10Jbond: O:puppetmaster:r10k: make config more flexible [puppet] - 10https://gerrit.wikimedia.org/r/736436 [10:54:48] (03PS5) 10ArielGlenn: add the Wikimedia Enterprise content downloader script [puppet] - 10https://gerrit.wikimedia.org/r/734622 (https://phabricator.wikimedia.org/T273585) [10:55:55] (03CR) 10David Caro: [C: 03+2] "All changes applied now, good to go, will have to sort out the deploy part of it later, and add the actual existing keyrings too." [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [10:56:55] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4355 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [10:57:53] (Traffic bill over quota) resolved: Traffic bill over quota - https://alerts.wikimedia.org [10:58:55] (03PS1) 10Ladsgroup: admin: Fix the prefix [puppet] - 10https://gerrit.wikimedia.org/r/736437 [10:58:55] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.08065 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [10:59:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32093/console" [puppet] - 10https://gerrit.wikimedia.org/r/736436 (owner: 10Jbond) [10:59:44] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:puppetmaster:r10k: make config more flexible [puppet] - 10https://gerrit.wikimedia.org/r/736436 (owner: 10Jbond) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211103T1100). [11:00:05] inductiveload: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:09] (03CR) 10David Caro: [C: 03+2] ceph: introduce auth load abstraction (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [11:01:15] inductiveload: hey, around? [11:01:31] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/736437 (owner: 10Ladsgroup) [11:01:40] (03CR) 10Muehlenhoff: "After some more digging "CVE-2021-38161 Not validating origin TLS certificate" appears to be" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/736420 (https://phabricator.wikimedia.org/T294897) (owner: 10Vgutierrez) [11:01:41] hello :-) [11:01:59] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:02:41] (03CR) 10ArielGlenn: [C: 03+2] add the Wikimedia Enterprise content downloader script [puppet] - 10https://gerrit.wikimedia.org/r/734622 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [11:02:45] In that case… [11:02:49] I can deploy today [11:03:35] I have an file to transfer to test [11:04:03] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 32, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:04:44] (03PS3) 10Urbanecm: Wikisource: allow copy-uploads from Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736215 (https://phabricator.wikimedia.org/T294824) (owner: 10Inductiveload) [11:05:09] (03CR) 10Urbanecm: [C: 03+2] Wikisource: allow copy-uploads from Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736215 (https://phabricator.wikimedia.org/T294824) (owner: 10Inductiveload) [11:05:28] if you see whines abuot labstore1006,7 for puppet, I'm working on it. pcc liked it, puppet doesn't, prolly a typo [11:05:31] inductiveload: that’s great, I’ll ping you when ready. [11:05:41] thanks [11:06:13] (03Merged) 10jenkins-bot: Wikisource: allow copy-uploads from Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736215 (https://phabricator.wikimedia.org/T294824) (owner: 10Inductiveload) [11:07:47] (03PS1) 10Jbond: puppetmaster::r10k: change exec to refreshonly and subscribe [puppet] - 10https://gerrit.wikimedia.org/r/736440 [11:08:13] (03CR) 10Jbond: [C: 03+2] puppetmaster::r10k: change exec to refreshonly and subscribe [puppet] - 10https://gerrit.wikimedia.org/r/736440 (owner: 10Jbond) [11:09:16] inductiveload: please test at mwdebug1001 [11:09:31] hmm [11:09:35] (03PS1) 10ArielGlenn: fix up name of enterprise dumps downloader script [puppet] - 10https://gerrit.wikimedia.org/r/736441 (https://phabricator.wikimedia.org/T273585) [11:09:42] can one do that with PWB? [11:10:23] inductiveload: if you can convince it to pass the header...sure [11:10:28] but can't you do it via web as an oneoff? [11:10:43] oh yeah of course i can [11:10:44] duh [11:10:49] good :) [11:11:04] (03CR) 10ArielGlenn: [C: 03+2] fix up name of enterprise dumps downloader script [puppet] - 10https://gerrit.wikimedia.org/r/736441 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [11:12:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:01] urbanecm: that totally worked [11:13:13] great, syncing [11:14:19] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 7fdf3f5476d9d8ab45eb793090613e328a91bb7a: Wikisource: allow copy-uploads from Commons (T294824) (duration: 01m 04s) [11:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:24] T294824: Wikisource: allow copy-uploads from Commons - https://phabricator.wikimedia.org/T294824 [11:14:29] inductiveload: done [11:14:30] anything else? [11:14:41] not today, thank you ^_^ [11:15:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, merging." [puppet] - 10https://gerrit.wikimedia.org/r/736437 (owner: 10Ladsgroup) [11:15:09] (03CR) 10Muehlenhoff: [C: 03+2] admin: Fix the prefix [puppet] - 10https://gerrit.wikimedia.org/r/736437 (owner: 10Ladsgroup) [11:15:12] i'll surely bug you again soon though 😈 [11:15:14] no whines, problem fixed, that's it [11:15:16] hehe [11:15:20] glad i could help [11:15:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:06] (03PS3) 10Vgutierrez: Release 8.0.8-1wm5 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/736420 (https://phabricator.wikimedia.org/T294897) [11:16:07] PROBLEM - LVS wcqs codfw port 443/tcp - Wikimedia Commons Query Service IPv4 on wcqs.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:17:30] (03CR) 10Vgutierrez: Release 8.0.8-1wm5 (032 comments) [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/736420 (https://phabricator.wikimedia.org/T294897) (owner: 10Vgutierrez) [11:22:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/736420 (https://phabricator.wikimedia.org/T294897) (owner: 10Vgutierrez) [11:23:34] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.8-1wm5 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/736420 (https://phabricator.wikimedia.org/T294897) (owner: 10Vgutierrez) [11:27:29] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) p:05Triage→03Medium [11:28:24] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: investigate puppet-lint-security-plugins - https://phabricator.wikimedia.org/T294907 (10jbond) [11:30:31] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) [11:30:37] 10Puppet, 10Infrastructure-Foundations: Admin module should use systemd-sysuser for system accounts - https://phabricator.wikimedia.org/T292965 (10jbond) [11:31:04] 10Puppet, 10Infrastructure-Foundations: Hosts distribution across puppetmasters - https://phabricator.wikimedia.org/T291541 (10jbond) [11:31:11] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) [11:31:19] 10Puppet, 10Infrastructure-Foundations: apt::package_from component dosn't corretlly support passing packages via a hash - https://phabricator.wikimedia.org/T291370 (10jbond) 05In progress→03Resolved [11:31:47] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) [11:31:53] 10Puppet, 10Infrastructure-Foundations: Puppetdb: not refreshed on config change? - https://phabricator.wikimedia.org/T291540 (10jbond) [11:31:59] (03PS1) 10Muehlenhoff: Switch ganeti-test2001 to ganeti_test role [puppet] - 10https://gerrit.wikimedia.org/r/736447 (https://phabricator.wikimedia.org/T286206) [11:32:34] (03CR) 10Ssingh: [C: 03+1] "Good to go with the NOOPs. (Should we update the commit message?)" [puppet] - 10https://gerrit.wikimedia.org/r/735410 (owner: 10Ayounsi) [11:32:42] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: should we move $site global to a fact - https://phabricator.wikimedia.org/T289678 (10jbond) [11:32:48] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) [11:33:25] (03PS6) 10Ayounsi: Bird: peer with router IP (gateway) if nothing explicitly set [puppet] - 10https://gerrit.wikimedia.org/r/735410 [11:35:58] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) [11:36:00] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10jbond) [11:36:16] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) [11:36:23] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppet-rspec has trouble testing custom facts - https://phabricator.wikimedia.org/T285476 (10jbond) [11:36:25] (03CR) 10Muehlenhoff: [C: 03+2] Switch ganeti-test2001 to ganeti_test role [puppet] - 10https://gerrit.wikimedia.org/r/736447 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [11:36:37] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) [11:36:41] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Puppet does not undo manual "systemctl mask $unit" - https://phabricator.wikimedia.org/T285425 (10jbond) [11:36:55] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) [11:37:01] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Add type validation to puppetmaster::standalone - https://phabricator.wikimedia.org/T284082 (10jbond) [11:37:06] 10Puppet, 10SRE-OnFire, 10Infrastructure-Foundations, 10User-jbond: Create SRE checklist for puppet - https://phabricator.wikimedia.org/T284073 (10jbond) 05Stalled→03Resolved [11:37:22] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) [11:37:28] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: wmf-stylguid checks: unable to ignore violations inside roles - https://phabricator.wikimedia.org/T280353 (10jbond) [11:37:29] !log start of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https [11:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:36] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) [11:37:42] 10Puppet, 10SRE, 10Infrastructure-Foundations: using the include function can trigger false positives with puppet-lint-wmf_styleguide - https://phabricator.wikimedia.org/T275387 (10jbond) [11:37:51] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) [11:37:54] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [11:38:09] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) [11:38:14] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, and 2 others: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) [11:38:25] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, and 2 others: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) [11:38:28] (03CR) 10Muehlenhoff: Add ownership annotations for WMCS services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732307 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [11:38:44] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) [11:38:50] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Orchestrator, and 2 others: Puppet host certs do not contain Subject Alt Name entries - https://phabricator.wikimedia.org/T273637 (10jbond) [11:38:57] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) [11:39:01] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10jbond) [11:39:11] 10Puppet, 10Infrastructure-Foundations: puppet new facts for php_version and python_version - https://phabricator.wikimedia.org/T271196 (10jbond) p:05Triage→03Low [11:39:19] 10Puppet, 10Infrastructure-Foundations: puppet new facts for php_version and python_version - https://phabricator.wikimedia.org/T271196 (10jbond) [11:39:25] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) [11:39:52] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Add check for ssh key type in admin module CI - https://phabricator.wikimedia.org/T270073 (10jbond) [11:39:58] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) [11:40:06] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) [11:40:12] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: Review puppetmaster SSL configueration - https://phabricator.wikimedia.org/T268040 (10jbond) [11:40:17] (03CR) 10Ayounsi: [C: 03+2] Bird: peer with router IP (gateway) if nothing explicitly set [puppet] - 10https://gerrit.wikimedia.org/r/735410 (owner: 10Ayounsi) [11:40:29] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Puppet clean up Parent task - https://phabricator.wikimedia.org/T267395 (10jbond) 05Open→03Resolved [11:41:32] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10Patch-For-Review: Ensure Puppet checks types as part of the build - https://phabricator.wikimedia.org/T261693 (10jbond) 05Open→03Declined [11:42:53] (03CR) 10Ayounsi: "Confirmed NOOP on dns2001.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/735410 (owner: 10Ayounsi) [11:43:43] (03CR) 10David Caro: [C: 03+1] Add ownership annotations for WMCS services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732307 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [11:45:22] (03PS1) 10Ladsgroup: admin: Adding myself to ops [puppet] - 10https://gerrit.wikimedia.org/r/736448 [11:49:10] (03PS1) 10Muehlenhoff: Reset profile::ganeti::ganeti216 to false for the role [puppet] - 10https://gerrit.wikimedia.org/r/736449 (https://phabricator.wikimedia.org/T286206) [11:50:05] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) [11:50:07] 10Puppet, 10Infrastructure-Foundations: wmf-style lint detects variable expansion in variables as parameter declaration - https://phabricator.wikimedia.org/T260574 (10jbond) [11:50:21] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) [11:50:36] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) [11:50:38] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: puppetmaster - ignoring invalid UTF-8 byte sequences in data to be sent to PuppetDB - https://phabricator.wikimedia.org/T255667 (10jbond) [11:50:51] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements - https://phabricator.wikimedia.org/T294906 (10jbond) [11:50:54] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10jbond) [11:50:56] (03CR) 10Muehlenhoff: [C: 03+2] Reset profile::ganeti::ganeti216 to false for the role [puppet] - 10https://gerrit.wikimedia.org/r/736449 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [11:52:17] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: Add CI check to ensure defaults exist in cloud.yaml - https://phabricator.wikimedia.org/T248994 (10jbond) 05Open→03Resolved a:03jbond resolving this task, we do have a check however i think we should fix this by using via different means [11:52:22] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10jbond) [11:52:49] 10Puppet, 10SRE, 10Infrastructure-Foundations: Upgrade Puppet to 5.5.21 - https://phabricator.wikimedia.org/T248168 (10jbond) 05Open→03Resolved a:03jbond We are currently running the latest 5.5.* branch [11:53:09] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: wmf-styleguide checks: unable to ignore violations inside roles - https://phabricator.wikimedia.org/T280353 (10Aklapper) [11:53:16] (03CR) 10LSobanski: [C: 03+1] admin: Adding myself to ops [puppet] - 10https://gerrit.wikimedia.org/r/736448 (owner: 10Ladsgroup) [11:53:36] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10Aklapper) [11:54:36] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-Joe: Update puppet code to conform to puppet 4.x and later standards - https://phabricator.wikimedia.org/T181967 (10jbond) [11:54:43] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [11:54:50] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-Joe: Update puppet code to conform to puppet 4.x and later standards - https://phabricator.wikimedia.org/T181967 (10jbond) [11:54:59] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [11:55:03] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Clean up SSL configuration - https://phabricator.wikimedia.org/T240941 (10jbond) [11:55:35] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [11:55:58] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [11:56:04] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: puppetmasters: update the puppet masters so they use them self for the puppet run - https://phabricator.wikimedia.org/T238093 (10jbond) [11:56:28] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10jbond) [11:56:34] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [11:56:35] !log deploying wikidiff2-1.13.0-1 to A:mw-codfw and A:mw-api-codfw [11:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:03] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, and 2 others: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) [11:57:05] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10User-jbond: puppet master command will be removed in puppet 6 - https://phabricator.wikimedia.org/T236373 (10jbond) [11:57:16] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Investigate using the rich_data option to support Binary and binary_file for binary data - https://phabricator.wikimedia.org/T236481 (10jbond) 05Open→03Declined This is not usefull untill puppet6 where we get it by default [11:57:45] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [11:57:52] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: missing CRL - https://phabricator.wikimedia.org/T235185 (10jbond) [11:58:28] !log rolling restart-php7.2-fpm on A:mw-codfw and A:mw-api-codfw [11:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:02] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [11:59:12] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10jbond) [11:59:20] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [11:59:26] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10observability, and 2 others: Puppet: get data (row, rack, site, and other information) from Netbox - https://phabricator.wikimedia.org/T229397 (10jbond) [12:00:28] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Packaging, and 2 others: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) [12:00:35] 10SRE, 10Traffic, 10Wikimedia Enterprise, 10Wikimedia Enterprise Discussion: Allow-Listing for Enterprise IPs - https://phabricator.wikimedia.org/T294798 (10AnnaMikla) [12:00:39] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: facter3: use structured facts - https://phabricator.wikimedia.org/T222160 (10jbond) 05Open→03Declined legacy facts are going o be around for a long time as they are backed into puppet itself. As such there is no need to do this [12:00:43] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [12:00:55] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppet fact: migrate away from the uniqueid fact - https://phabricator.wikimedia.org/T221083 (10jbond) [12:01:06] 10Puppet, 10SRE, 10SRE-tools, 10Infrastructure-Foundations: wmf_style check in puppet silently fails when it finds the addition of an error that was also already occurring in the same file - https://phabricator.wikimedia.org/T219085 (10jbond) [12:01:12] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [12:02:12] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [12:02:19] 10Puppet, 10SRE, 10Infrastructure-Foundations: puppet (systemd::service) attempts to start manually masked units - https://phabricator.wikimedia.org/T211027 (10jbond) [12:04:19] 10SRE, 10SRE-Access-Requests: Requesting access to Production Shell Groups & JupyterHub for echetty - https://phabricator.wikimedia.org/T294229 (10EChetty) Working as expected! Thank you [12:04:37] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, and 2 others: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) [12:05:08] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: require_package should mark packages as manually installed - https://phabricator.wikimedia.org/T195981 (10jbond) 05Open→03Resolved a:03jbond reqiure_package has been removed [12:05:16] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Tracking-Neverending: Puppet: tracking catalogs that changes at every run - https://phabricator.wikimedia.org/T191388 (10jbond) 05Open→03Resolved a:03jbond We now have this as a nrpe check running on the puppetdb servers [12:05:23] (03PS1) 10Muehlenhoff: Add further engineering managers for ops: approval [puppet] - 10https://gerrit.wikimedia.org/r/736451 [12:05:28] 10Puppet, 10Infrastructure-Foundations: Investigate using SRV records for puppet - https://phabricator.wikimedia.org/T190665 (10jbond) p:05Triage→03Medium [12:05:44] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [12:05:50] 10Puppet, 10Infrastructure-Foundations: Investigate using SRV records for puppet - https://phabricator.wikimedia.org/T190665 (10jbond) [12:06:52] 10Puppet, 10Infrastructure-Foundations: Investigate wrong location for /srv/private post-receive hook in puppetmaster::gitclone - https://phabricator.wikimedia.org/T190157 (10jbond) 05Open→03Resolved a:03jbond this has since been fixed [12:07:08] 10Puppet, 10SRE, 10Infrastructure-Foundations: Decrease the amount of IRC spam in case of widespread puppet failures - https://phabricator.wikimedia.org/T188602 (10jbond) 05Open→03Resolved a:03jbond this has been fixed [12:08:02] 10Puppet, 10SRE, 10Infrastructure-Foundations: /etc/puppet/hiera.yaml: Use of 'hiera.yaml' version 3 is deprecated. It should be converted to version 5 - https://phabricator.wikimedia.org/T185814 (10jbond) 05Open→03Resolved a:03jbond we are now using hiera version 5 everywhere [12:08:28] 10Puppet, 10SRE, 10Infrastructure-Foundations: Fix unknown variables warning that occur with puppet 4.x - https://phabricator.wikimedia.org/T184186 (10jbond) [12:08:31] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [12:08:48] (03PS1) 10Urbanecm: foundationwiki: Increase AF throttle requirements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736452 [12:08:52] jouncebot: nowandnext [12:08:53] No deployments scheduled for the next 5 hour(s) and 51 minute(s) [12:08:53] In 5 hour(s) and 51 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211103T1800) [12:08:53] In 5 hour(s) and 51 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211103T1800) [12:08:58] (03PS1) 10Majavah: admin: Add ssh key format check to CI [puppet] - 10https://gerrit.wikimedia.org/r/736453 (https://phabricator.wikimedia.org/T270073) [12:09:05] (03CR) 10Urbanecm: [C: 03+2] foundationwiki: Increase AF throttle requirements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736452 (owner: 10Urbanecm) [12:09:45] 10Puppet, 10SRE, 10Infrastructure-Foundations: Fix regex.yaml single-regex issue - https://phabricator.wikimedia.org/T183565 (10jbond) [12:09:51] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [12:09:57] (03Merged) 10jenkins-bot: foundationwiki: Increase AF throttle requirements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736452 (owner: 10Urbanecm) [12:10:19] (03PS1) 10Urbanecm: Revert "Adjust AF config for ukwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736253 (https://phabricator.wikimedia.org/T272330) [12:10:28] (03CR) 10jerkins-bot: [V: 04-1] Revert "Adjust AF config for ukwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736253 (https://phabricator.wikimedia.org/T272330) (owner: 10Urbanecm) [12:10:52] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-Joe: Puppet4: hiera() can only be called using the 4.x function API. - https://phabricator.wikimedia.org/T179181 (10jbond) 05Open→03Resolved a:03jbond We have dropped all uses of the hiera call so assuming this can be closed, please reopen if some... [12:10:56] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254 (10jbond) [12:11:15] 10Puppet, 10Infrastructure-Foundations: Set puppet config_version to something referring to git - https://phabricator.wikimedia.org/T171477 (10jbond) 05Open→03Resolved a:03jbond this has since been added [12:11:30] (03CR) 10jerkins-bot: [V: 04-1] admin: Add ssh key format check to CI [puppet] - 10https://gerrit.wikimedia.org/r/736453 (https://phabricator.wikimedia.org/T270073) (owner: 10Majavah) [12:11:44] (03PS2) 10Urbanecm: Revert "Adjust AF config for ukwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736253 (https://phabricator.wikimedia.org/T272330) [12:11:59] (03CR) 10Urbanecm: [C: 03+2] Revert "Adjust AF config for ukwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736253 (https://phabricator.wikimedia.org/T272330) (owner: 10Urbanecm) [12:12:07] 10Puppet, 10SRE, 10Infrastructure-Foundations: Use multiple puppetdbs on puppet masters - https://phabricator.wikimedia.org/T169318 (10jbond) [12:12:13] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [12:12:15] (03PS2) 10Majavah: admin: Add ssh key format check to CI [puppet] - 10https://gerrit.wikimedia.org/r/736453 (https://phabricator.wikimedia.org/T270073) [12:12:18] (03CR) 10jerkins-bot: [V: 04-1] Revert "Adjust AF config for ukwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736253 (https://phabricator.wikimedia.org/T272330) (owner: 10Urbanecm) [12:13:02] (03PS3) 10Urbanecm: Revert "Adjust AF config for ukwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736253 (https://phabricator.wikimedia.org/T272330) [12:13:08] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [12:13:15] 10Puppet, 10SRE, 10Infrastructure-Foundations: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10jbond) [12:13:16] (03CR) 10Urbanecm: [C: 03+2] Revert "Adjust AF config for ukwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736253 (https://phabricator.wikimedia.org/T272330) (owner: 10Urbanecm) [12:13:40] (03CR) 10Ladsgroup: [C: 03+1] Add further engineering managers for ops: approval [puppet] - 10https://gerrit.wikimedia.org/r/736451 (owner: 10Muehlenhoff) [12:13:41] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 667ef0b6e9e8d1d70061cc904ce49e7632300b75: foundationwiki: Increase AF throttle requirements (duration: 01m 13s) [12:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:53] 10Puppet, 10Infrastructure-Foundations: Nuyaml_backend does not allow binary Hiera data - https://phabricator.wikimedia.org/T113328 (10jbond) the numa yaml back end has since been re-writen, can you confirm if this is still an issues? [12:14:14] (03Merged) 10jenkins-bot: Revert "Adjust AF config for ukwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736253 (https://phabricator.wikimedia.org/T272330) (owner: 10Urbanecm) [12:14:42] 10Puppet, 10Infrastructure-Foundations: wmf-style lint detects variable expansion in variables as parameter declaration - https://phabricator.wikimedia.org/T260574 (10jbond) p:05Triage→03Medium [12:14:44] (03CR) 10jerkins-bot: [V: 04-1] admin: Add ssh key format check to CI [puppet] - 10https://gerrit.wikimedia.org/r/736453 (https://phabricator.wikimedia.org/T270073) (owner: 10Majavah) [12:15:06] 10Puppet, 10Infrastructure-Foundations: Allow variables without hiera calls as lookup() default parameters - https://phabricator.wikimedia.org/T234459 (10jbond) [12:15:13] 10Puppet, 10Infrastructure-Foundations: wmf-style lint detects variable expansion in variables as parameter declaration - https://phabricator.wikimedia.org/T260574 (10jbond) [12:15:15] (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/736453 (https://phabricator.wikimedia.org/T270073) (owner: 10Majavah) [12:15:29] 10Puppet, 10Infrastructure-Foundations: wmf-style lint detects variable expansion in variables as parameter declaration - https://phabricator.wikimedia.org/T260574 (10jbond) [12:15:36] 10Puppet, 10SRE, 10Infrastructure-Foundations: wmf-style adds 'has no call to hiera' violations for parameters already containing hiera calls - https://phabricator.wikimedia.org/T207285 (10jbond) [12:15:42] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 9ca753bf4b7afea41c29225d4f32e3ba01bf7c30: Revert "Adjust AF config for ukwiki" (T272330) (duration: 01m 03s) [12:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:29] hi all sorry for all the phabricator spam, i have finished organising now [12:18:20] !log upload trafficserver 8.0.8-1wm5 to apt.wm.org (buster) - T294897 [12:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:24] (03CR) 10LSobanski: [C: 03+1] Add further engineering managers for ops: approval [puppet] - 10https://gerrit.wikimedia.org/r/736451 (owner: 10Muehlenhoff) [12:18:46] jbond: do you happen to know why the ci for https://gerrit.wikimedia.org/r/c/operations/puppet/+/736453/ is failing? does not seem related to my patch [12:19:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:12] !log update trafficserver on cp4021 to 8.0.8-1wm5 - T294897 [12:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:35] (03PS1) 10Muehlenhoff: sre.ganeti.makevm: Relax globbing for interface name used in bridges [cookbooks] - 10https://gerrit.wikimedia.org/r/736456 [12:21:48] !log update trafficserver on cp4027 to 8.0.8-1wm5 - T294897 [12:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, merging" [puppet] - 10https://gerrit.wikimedia.org/r/736448 (owner: 10Ladsgroup) [12:22:52] (03CR) 10Muehlenhoff: [C: 03+2] admin: Adding myself to ops [puppet] - 10https://gerrit.wikimedia.org/r/736448 (owner: 10Ladsgroup) [12:27:45] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:29:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:52] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:29:56] (03CR) 10Jbond: [C: 03+1] "LGTM failure is unrelated to this" [puppet] - 10https://gerrit.wikimedia.org/r/736453 (https://phabricator.wikimedia.org/T270073) (owner: 10Majavah) [12:31:02] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: investigate puppet-lint-security-plugins - https://phabricator.wikimedia.org/T294907 (10jbond) [12:33:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:13] (03CR) 10JMeybohm: [C: 03+1] charts: bump common_templates to 0.4 and chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/736227 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [12:39:36] (03PS1) 10ArielGlenn: add credentials file for downloading enterprise html dumps [puppet] - 10https://gerrit.wikimedia.org/r/736461 (https://phabricator.wikimedia.org/T273585) [12:40:05] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [12:55:05] (03CR) 10JMeybohm: [C: 03+1] "Nice! Make sure to add a doc string to all helmfiles as well when removing the helm2/3 if-guard after migration." [deployment-charts] - 10https://gerrit.wikimedia.org/r/735979 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [13:05:49] (03CR) 10BBlack: [C: 03+2] conftool-data: remove "dns" cluster [puppet] - 10https://gerrit.wikimedia.org/r/735991 (owner: 10BBlack) [13:06:59] (03PS2) 10BBlack: conftool-data: remove "dns" cluster [puppet] - 10https://gerrit.wikimedia.org/r/735991 [13:07:40] (03CR) 10BBlack: [C: 03+2] conftool-data: remove "dns" cluster [puppet] - 10https://gerrit.wikimedia.org/r/735991 (owner: 10BBlack) [13:17:16] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/736451 (owner: 10Muehlenhoff) [13:19:07] (03CR) 10Muehlenhoff: [C: 03+1] "Nice! Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/736453 (https://phabricator.wikimedia.org/T270073) (owner: 10Majavah) [13:19:32] (03CR) 10Alexandros Kosiaris: [C: 03+1] kserve: add network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/732939 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [13:23:05] (03CR) 10Muehlenhoff: [C: 03+2] Add ownership annotations for WMCS services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732307 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [13:24:47] (03CR) 10LMata: [C: 03+1] Add further engineering managers for ops: approval [puppet] - 10https://gerrit.wikimedia.org/r/736451 (owner: 10Muehlenhoff) [13:25:58] (03CR) 10Muehlenhoff: "check" [puppet] - 10https://gerrit.wikimedia.org/r/702669 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [13:26:52] (03CR) 10Jelto: [C: 03+2] services: add support to deploy all services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/735979 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [13:27:12] (03CR) 10Jelto: [C: 03+2] charts: bump common_templates to 0.4 and chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/736227 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [13:27:15] (03PS1) 10Majavah: smart: Fix quotes in tests [puppet] - 10https://gerrit.wikimedia.org/r/736468 [13:29:14] (03CR) 10Majavah: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/736468/ fixes the test suite" [puppet] - 10https://gerrit.wikimedia.org/r/736453 (https://phabricator.wikimedia.org/T270073) (owner: 10Majavah) [13:29:42] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: require_package should mark packages as manually installed - https://phabricator.wikimedia.org/T195981 (10MoritzMuehlenhoff) >>! In T195981#7477405, @jbond wrote: > reqiure_package has been removed But ensure_packages has the same issue, hasn't it? [13:31:12] (03Merged) 10jenkins-bot: services: add support to deploy all services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/735979 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [13:31:35] (03Merged) 10jenkins-bot: charts: bump common_templates to 0.4 and chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/736227 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [13:34:01] !log cp403[3456] - depool ats-be service (upcoming re-reimage) [13:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:22] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp403[3456].*,service=ats-be [13:34:23] (03CR) 10Arturo Borrero Gonzalez: add credentials file for downloading enterprise html dumps (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/736461 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [13:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:24] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/736456 (owner: 10Muehlenhoff) [13:38:24] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [13:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:41] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10Ottomata) Thank you! Just in time! :) [13:40:50] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:53:58] (03PS1) 10David Caro: p:ceph::auth: add nova-compute dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/736470 [14:02:03] (03CR) 10Ottomata: Add checks for druid datasources to alertmanager (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/736279 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [14:02:18] (03PS1) 10Ssingh: test_dns: strip trailing = from urlsafe_b64encode [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/736472 [14:04:21] !log update eqsin and ulsfo cp instances to ATS 8.0.8-1wm5 - T294897 [14:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:55] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wcqs_443: Servers wcqs1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:05:38] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wcqs_443: Servers wcqs1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:07:33] !log move cr2-codfw access switches link to working linecard - T289241 [14:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:58] RECOVERY - OSPF status on mr1-codfw is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:10:37] !log initialising ganeti-test01.svc.codfw.wmnet cluster on ganeti-test2001 T286206 [14:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:40] T286206: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 [14:12:51] (03PS5) 10Kormat: mariadb: Set important db host monitoring to critical. [puppet] - 10https://gerrit.wikimedia.org/r/736415 (https://phabricator.wikimedia.org/T233684) [14:13:08] (03PS2) 10ArielGlenn: add credentials file for downloading enterprise html dumps [puppet] - 10https://gerrit.wikimedia.org/r/736461 (https://phabricator.wikimedia.org/T273585) [14:13:17] !log installing remaining tiff security updates for buster [14:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:44] RECOVERY - Juniper alarms on cr2-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [14:14:07] (03CR) 10ArielGlenn: add credentials file for downloading enterprise html dumps (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/736461 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [14:15:56] PROBLEM - ats-tls HTTPS wikiworkshop.org ECDSA on cp5009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [14:16:35] !log deploying wikidiff2-1.13.0-1 to A:mw-eqiad and A:mw-api-eqiad [14:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:51] (03PS1) 10MMandere: install_server: Update instance hardware category [puppet] - 10https://gerrit.wikimedia.org/r/736475 (https://phabricator.wikimedia.org/T290694) [14:17:11] ^^ cp5009 is being updated [14:18:00] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' . [14:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) Hey Guys, The cabling plan for the switch->switch cabling in the new Eqiad cage should be as follows: ` LSW1-E1 Links: LSW1-E... [14:18:20] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 127, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:19:02] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 14 NOOP 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32101/console" [puppet] - 10https://gerrit.wikimedia.org/r/736415 (https://phabricator.wikimedia.org/T233684) (owner: 10Kormat) [14:20:08] RECOVERY - ats-tls HTTPS wikiworkshop.org ECDSA on cp5009 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 387592 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2021-12-05 17:00:17 +0000 (expires in 32 days) https://wikitech.wikimedia.org/wiki/HTTPS [14:20:17] !log rolling restart-php7.2-fpm on A:mw-eqiad and A:mw-api-eqiad [14:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:03] (03CR) 10David Caro: [V: 03+2 C: 03+2] p:ceph::auth: add nova-compute dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/736470 (owner: 10David Caro) [14:21:40] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [14:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:34] jelto: o/ I'd need to deploy a new version of the api-gateway chart in codfw/eqiad, can I proceed or do you prefer to do it? [14:23:40] changes already applied in staging [14:27:00] elukey: hey o/ as I mentioned in the ops mail I bumped the common_templates. So some minor envoy config changes can be expected. Feel free to deploy it yourself as usual. [14:28:36] ack! [14:30:04] (03PS1) 10Vgutierrez: site: Use role cache::upload_haproxy for cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/736477 (https://phabricator.wikimedia.org/T290005) [14:30:19] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [14:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:11] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [14:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:48] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [14:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:41] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [14:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:28] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' . [14:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:36] (03CR) 10Ssingh: [C: 03+2] test_dns: strip trailing = from urlsafe_b64encode [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/736472 (owner: 10Ssingh) [14:35:57] (03CR) 10BBlack: [C: 03+1] "LGTM! Thanks, and sorry for not catching this before yesterday's reimages!" [puppet] - 10https://gerrit.wikimedia.org/r/736475 (https://phabricator.wikimedia.org/T290694) (owner: 10MMandere) [14:36:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [14:37:30] !log elukey@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [14:37:30] !log elukey@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [14:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:40] !log installing elfutils security updates on stretch [14:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:45] (03CR) 10MMandere: [C: 03+2] install_server: Update instance hardware category [puppet] - 10https://gerrit.wikimedia.org/r/736475 (https://phabricator.wikimedia.org/T290694) (owner: 10MMandere) [14:48:52] (03PS1) 10David Caro: ceph::auth: Add nova-compute client key [puppet] - 10https://gerrit.wikimedia.org/r/736483 (https://phabricator.wikimedia.org/T293752) [14:49:05] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.makevm: Relax globbing for interface name used in bridges [cookbooks] - 10https://gerrit.wikimedia.org/r/736456 (owner: 10Muehlenhoff) [14:54:40] !log elukey@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [14:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:23] (03PS1) 10David Caro: ceph::auth: rename compute auth key [labs/private] - 10https://gerrit.wikimedia.org/r/736489 [15:00:20] (03PS1) 10Btullis: Add the first eventgate alert to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/736490 (https://phabricator.wikimedia.org/T293399) [15:01:53] (03CR) 10David Caro: [V: 03+2 C: 03+2] ceph::auth: rename compute auth key [labs/private] - 10https://gerrit.wikimedia.org/r/736489 (owner: 10David Caro) [15:02:34] (03PS2) 10Btullis: Add checks for druid datasources to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/736279 (https://phabricator.wikimedia.org/T293399) [15:05:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [15:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:18] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [15:06:18] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [15:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:29] (03PS7) 10Elukey: kserve: add network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/732939 (https://phabricator.wikimedia.org/T289834) [15:08:46] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [15:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:33] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [15:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:59] (03PS1) 10Jgiannelos: tegola-vector-tiles: Use batched tile changes kafka stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/736494 (https://phabricator.wikimedia.org/T293366) [15:10:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet [15:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:31] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [15:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:31] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [15:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:57] 10SRE, 10ops-ulsfo: ps1-22-ulsfo Cord, Master_Cord_A, Active Power alerting - https://phabricator.wikimedia.org/T294891 (10wiki_willy) a:03RobH [15:12:05] (03CR) 10Jgiannelos: "We changed the event schema for map tile changes in order to be able to batch multiple tiles on a single event. This change introduces the" [deployment-charts] - 10https://gerrit.wikimedia.org/r/736494 (https://phabricator.wikimedia.org/T293366) (owner: 10Jgiannelos) [15:13:23] (03CR) 10Jgiannelos: "Here is a link to the new schema:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/736494 (https://phabricator.wikimedia.org/T293366) (owner: 10Jgiannelos) [15:17:32] (03PS2) 10Ladsgroup: smart: Fix quotes in tests [puppet] - 10https://gerrit.wikimedia.org/r/736468 (owner: 10Majavah) [15:17:47] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] smart: Fix quotes in tests [puppet] - 10https://gerrit.wikimedia.org/r/736468 (owner: 10Majavah) [15:17:53] (03CR) 10Elukey: [C: 03+2] kserve: add network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/732939 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [15:18:00] (03CR) 10David Caro: [V: 03+1] "Even if the pcc does not show it as a difference, this introduces a second cehp::auth::keyring with the new data on cloudcephmon2001-dev" [puppet] - 10https://gerrit.wikimedia.org/r/736483 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [15:18:59] (03PS3) 10Majavah: admin: Add ssh key format check to CI [puppet] - 10https://gerrit.wikimedia.org/r/736453 (https://phabricator.wikimedia.org/T270073) [15:21:03] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:05] !log ppchelko@deploy1002 Started deploy [restbase/deploy@664a2f8]: Add new wikis T292422 T294587 T294588 [15:21:07] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:11] T292422: Add amiwiki to RESTBase - https://phabricator.wikimedia.org/T292422 [15:21:11] T294587: Add pwnwiki to RESTBase - https://phabricator.wikimedia.org/T294587 [15:21:11] T294588: Add lmowiktionary to RESTBase - https://phabricator.wikimedia.org/T294588 [15:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:53] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: Add ssh key format check to CI [puppet] - 10https://gerrit.wikimedia.org/r/736453 (https://phabricator.wikimedia.org/T270073) (owner: 10Majavah) [15:22:11] !log ppchelko@deploy1002 Started deploy [restbase/deploy@664a2f8]: Add new wikis T292422 T294587 T294588 [15:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:32] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Add check for ssh key type in admin module CI - https://phabricator.wikimedia.org/T270073 (10Majavah) 05Open→03Resolved [15:22:38] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10Majavah) [15:22:47] !log ppchelko@deploy1002 Finished deploy [restbase/deploy@664a2f8]: Add new wikis T292422 T294587 T294588 (duration: 00m 36s) [15:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:50] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Add check for ssh key type in admin module CI - https://phabricator.wikimedia.org/T270073 (10Majavah) a:05jbond→03Majavah [15:25:56] (03PS1) 10Ladsgroup: [DNM] Test breaking ssh key [puppet] - 10https://gerrit.wikimedia.org/r/736498 (https://phabricator.wikimedia.org/T270073) [15:26:48] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Test breaking ssh key [puppet] - 10https://gerrit.wikimedia.org/r/736498 (https://phabricator.wikimedia.org/T270073) (owner: 10Ladsgroup) [15:28:54] (03CR) 10Majavah: lists: Split ferm and monitoring of profile::lists (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/731286 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [15:30:31] (03CR) 10Legoktm: [C: 03+1] lists: Split ferm and monitoring of profile::lists [puppet] - 10https://gerrit.wikimedia.org/r/731286 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [15:30:33] (03CR) 10Btullis: Add checks for druid datasources to alertmanager (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/736279 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [15:30:37] (03Abandoned) 10Ladsgroup: [DNM] Test breaking ssh key [puppet] - 10https://gerrit.wikimedia.org/r/736498 (https://phabricator.wikimedia.org/T270073) (owner: 10Ladsgroup) [15:31:38] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Add check for ssh key type in admin module CI - https://phabricator.wikimedia.org/T270073 (10Ladsgroup) Legend: ` 15:26:32 =================================== FAILURES =================================== 15:26:32 ____________________... [15:32:00] (03CR) 10Ladsgroup: lists: Split ferm and monitoring of profile::lists (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/731286 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [15:35:10] (03PS1) 10Majavah: P::configmaster: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/736499 [15:36:00] (03CR) 10Majavah: toolforge::cronrunner: disable cron on non-active hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732986 (https://phabricator.wikimedia.org/T284767) (owner: 10Majavah) [15:37:16] (03CR) 10Herron: [C: 03+2] base_packages: install netcat-openbsd by default [puppet] - 10https://gerrit.wikimedia.org/r/735413 (owner: 10Herron) [15:38:49] (WdqsStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [15:43:49] (WdqsStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [15:43:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [15:45:03] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is CRITICAL: 1.016e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [15:45:32] (03CR) 10Ottomata: Add checks for druid datasources to alertmanager (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/736279 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [15:47:27] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' . [15:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:06] so main-codfw's mirror maker seems to struggle a bit (intermittently) to mirror the eqiad ChangeDeletionNotification topic [15:48:09] https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&refresh=5m&from=now-24h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&var-topic=eqiad.mediawiki.job.ChangeDeletionNotification [15:48:27] (03PS5) 10Ladsgroup: lists: Split ferm and monitoring of profile::lists [puppet] - 10https://gerrit.wikimedia.org/r/731286 (https://phabricator.wikimedia.org/T282303) [15:48:28] and indeed there are spikes of traffic up to 2k msgs/s [15:49:53] the topic has only one partition, maybe we could expand it to 3 [15:50:05] even if the rate of traffic is not constant [15:50:50] (03CR) 10Ottomata: Add checks for druid datasources to alertmanager (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/736279 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [15:51:09] !log rolling restart-php7.2-fpm on A:mw-api-codfw to pick up wikidiff2 upgrade [15:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:09] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] lists: Split ferm and monitoring of profile::lists [puppet] - 10https://gerrit.wikimedia.org/r/731286 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [15:53:39] 10SRE, 10Community-Tech, 10serviceops, 10wikidiff2, 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10hnowlan) The deployment is just finishing up to codfw's API servers in the next few minutes, all others are complete. P... [15:53:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32105/console" [puppet] - 10https://gerrit.wikimedia.org/r/736483 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [15:53:46] 10SRE, 10Community-Tech, 10serviceops, 10wikidiff2, 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10hnowlan) [15:53:51] the alert should resolve soon, if it rehappens we can check what to do [15:54:14] 10SRE, 10Community-Tech, 10serviceops, 10wikidiff2, 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10hnowlan) PHP 7.4 images have been bumped via https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/pr... [15:55:02] (03PS1) 10Razzi: presto: enable ui [puppet] - 10https://gerrit.wikimedia.org/r/736503 (https://phabricator.wikimedia.org/T292087) [15:55:09] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01076 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:57:28] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [15:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:24] sounds good elukey and +1 for increasing partitions [15:58:38] (03CR) 10Razzi: "I put the config straight in the puppet code; if it would be better to make a yaml setting for this I can do that alternatively." [puppet] - 10https://gerrit.wikimedia.org/r/736503 (https://phabricator.wikimedia.org/T292087) (owner: 10Razzi) [15:58:40] !log depool cp4033.ulsfo.wmnet - T290694 [15:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:43] T290694: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 [15:58:52] 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10RobH) [15:58:55] puppet failures alert is related to my netcat change, will follow up with a patch to fix that [15:59:04] 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10RobH) [15:59:31] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [15:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:41] 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10RobH) a:03Papaul [16:00:28] herron: sending patch now [16:00:37] (03PS1) 10Herron: netconsole::server: remove duplicate netcat-openbsd package [puppet] - 10https://gerrit.wikimedia.org/r/736504 [16:00:39] (03CR) 10MSantos: [C: 03+2] tegola-vector-tiles: Use batched tile changes kafka stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/736494 (https://phabricator.wikimedia.org/T293366) (owner: 10Jgiannelos) [16:00:54] 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10RobH) [16:00:58] jbond: just uploaded https://gerrit.wikimedia.org/r/736504 too [16:00:59] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/736504 (owner: 10Herron) [16:01:07] yes +1 you beat me too it :) [16:01:16] 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10RobH) [16:01:27] (03CR) 10Herron: [C: 03+2] netconsole::server: remove duplicate netcat-openbsd package [puppet] - 10https://gerrit.wikimedia.org/r/736504 (owner: 10Herron) [16:01:32] 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10RobH) a:03Papaul [16:01:37] jbond: thx! [16:02:44] (03PS1) 10Jbond: (WIP) to test CI [puppet] - 10https://gerrit.wikimedia.org/r/736505 [16:03:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32107/console" [puppet] - 10https://gerrit.wikimedia.org/r/736505 (owner: 10Jbond) [16:04:52] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp4033.ulsfo.wmnet with OS buster [16:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:57] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) I am still not sure where this server is, I cannot find it. @Jclark-ctr is out this week. [16:04:59] (03Merged) 10jenkins-bot: tegola-vector-tiles: Use batched tile changes kafka stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/736494 (https://phabricator.wikimedia.org/T293366) (owner: 10Jgiannelos) [16:05:54] (03PS1) 10ArielGlenn: add fake enterprise api dumps downloader credentials [labs/private] - 10https://gerrit.wikimedia.org/r/736527 (https://phabricator.wikimedia.org/T273585) [16:07:16] 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): eqiad: patch 2nd Equinix IXP - https://phabricator.wikimedia.org/T293726 (10Cmjohnson) 05Open→03Resolved If there is a need to keep this open in DC-OPS please re-open [16:08:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:(Need By: TBD) rack/setup (4) fundraising hosts - https://phabricator.wikimedia.org/T289812 (10Cmjohnson) 05Open→03Resolved DC-OPS portion is complete, if there are any issues please re-open [16:15:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:28] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [16:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:57] 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10RobH) [16:21:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:19] 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10RobH) [16:21:33] 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10RobH) a:03Jclark-ctr [16:23:22] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [16:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:37] (03PS2) 10Btullis: Add the first eventgate alert to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/736490 (https://phabricator.wikimedia.org/T293399) [16:27:09] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [16:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:11] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005666 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:31:52] !log installing wikidiff2-1.13.0-1 to A:mw-jobrunner [16:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:39] (03PS1) 10Muehlenhoff: Extend ganeti-all alias to also include ganeti_test [puppet] - 10https://gerrit.wikimedia.org/r/736529 (https://phabricator.wikimedia.org/T286206) [16:34:45] PROBLEM - Host scs-a8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:35:03] (03CR) 10Btullis: presto: enable ui (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/736503 (https://phabricator.wikimedia.org/T292087) (owner: 10Razzi) [16:36:19] (03PS1) 10Ladsgroup: Fix name of the script [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/736531 [16:36:28] (03PS1) 10Jelto: charts:chromium-reader fix wrong labels and selectors in deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/736532 (https://phabricator.wikimedia.org/T292390) [16:37:34] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/736531 (owner: 10Ladsgroup) [16:38:23] (03CR) 10Muehlenhoff: [C: 03+2] Extend ganeti-all alias to also include ganeti_test [puppet] - 10https://gerrit.wikimedia.org/r/736529 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [16:38:25] 10SRE, 10ops-eqiad: Q1 '19:(Need by: 2020-06-30) replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10Cmjohnson) the new SCS is racked in A8 u47 [16:38:53] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [16:39:51] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:40:13] RECOVERY - Host scs-a8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.46 ms [16:43:17] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Investigate use of Puppet "environments" for per-project Puppet manifests - https://phabricator.wikimedia.org/T170370 (10jbond) I had/have a use case where i wanted to use some third party modules as such i have added r10k... [16:45:43] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Fix name of the script [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/736531 (owner: 10Ladsgroup) [16:46:56] (03CR) 10JMeybohm: [C: 03+1] charts:chromium-reader fix wrong labels and selectors in deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/736532 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [16:47:02] 10SRE-swift-storage: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380 (10MatthewVernon) Hi, Sorry for the delay in getting back to you. I have a couple of questions about your request, if I may: 1. Are you OK with using the `S3` protocol (rather than the Swift... [16:49:34] (03CR) 10Jelto: [C: 03+2] charts:chromium-reader fix wrong labels and selectors in deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/736532 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [16:50:17] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is CRITICAL: 1.136e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [16:52:34] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4033.ulsfo.wmnet with OS buster [16:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:41] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4033.ulsfo.wmnet with OS buster completed: - cp4033 (**WARN**... [16:53:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [16:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:24] (03Merged) 10jenkins-bot: charts:chromium-reader fix wrong labels and selectors in deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/736532 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [16:58:51] !log razzi@deploy1002 Started deploy [analytics/superset/deploy@5b8de4c]: Upgrade superset to 1.3.1 [16:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [16:59:21] !log razzi@deploy1002 Finished deploy [analytics/superset/deploy@5b8de4c]: Upgrade superset to 1.3.1 (duration: 00m 31s) [16:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:20] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [17:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:11] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [17:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:21] (03PS1) 10Ladsgroup: icinga: Add myself to authorized people [puppet] - 10https://gerrit.wikimedia.org/r/736537 [17:02:45] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 6630 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [17:04:53] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/736537 (owner: 10Ladsgroup) [17:05:15] !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [17:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:15] !log pool cp4033.ulsfo.wmnet - T290694 [17:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:17] T290694: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 [17:06:47] (03PS3) 10David Caro: ceph::auth: Add codfw1dev-compute client key [puppet] - 10https://gerrit.wikimedia.org/r/736483 (https://phabricator.wikimedia.org/T293752) [17:07:58] 10SRE, 10Infrastructure-Foundations, 10netops: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845 (10cmooney) Looks good. As discussed on irc I think the second term in "BGP_production" on mr1 isn't needed, although given how it works can hardly blame you for putting it... [17:08:47] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 6 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32108/console" [puppet] - 10https://gerrit.wikimedia.org/r/736483 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [17:10:16] (03PS1) 10David Caro: ceph::auth: add private repo note in fake data [labs/private] - 10https://gerrit.wikimedia.org/r/736538 [17:10:28] PROBLEM - Check systemd state on an-tool1010 is CRITICAL: CRITICAL - degraded: The following units failed: superset.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:29] (03CR) 10David Caro: [V: 03+2 C: 03+2] ceph::auth: add private repo note in fake data [labs/private] - 10https://gerrit.wikimedia.org/r/736538 (owner: 10David Caro) [17:12:49] (03PS1) 10Jelto: charts:chromium-reader bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/736540 (https://phabricator.wikimedia.org/T292390) [17:13:12] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 6 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32109/console" [puppet] - 10https://gerrit.wikimedia.org/r/736483 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [17:15:10] PROBLEM - SSH on wcqs2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:15:20] PROBLEM - SSH on wcqs1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:21:33] (03CR) 10Jelto: [C: 03+2] charts:chromium-reader bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/736540 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [17:22:21] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] icinga: Add myself to authorized people [puppet] - 10https://gerrit.wikimedia.org/r/736537 (owner: 10Ladsgroup) [17:24:45] !log upgrading PHP 7.2 on A:all-mw-codfw [17:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:44] (03Merged) 10jenkins-bot: charts:chromium-reader bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/736540 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [17:29:08] (03PS1) 10Milimetric: role::common::aqs: update druid mediawiki's datasource [puppet] - 10https://gerrit.wikimedia.org/r/736542 [17:30:50] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [17:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:37] !log adding BGP peering session to "P Foundation" / AS399728 on cr2-eqiad [Equinix Ashburn IXP] [17:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:17] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [17:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:23] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [17:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:13] (03CR) 10Ahmon Dancy: [C: 04-1] "Agreement on the MW_DEBUG_LOCAL issue. I'll make adjustments." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732737 (owner: 10Ahmon Dancy) [17:37:35] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: user@22656.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:40:23] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:41:31] the wcqs host issues are T294865 fyi, I've poked to see if they can get acked/downtimed [17:41:31] T294865: wcqs1002 and wcqs2001 unresponsive - https://phabricator.wikimedia.org/T294865 [17:42:24] 10SRE, 10ops-eqiad: eqiad: add VC-links IDs to Netbox - https://phabricator.wikimedia.org/T268750 (10Cmjohnson) row B eqiad updated [17:42:28] ACKNOWLEDGEMENT - LVS wcqs codfw port 443/tcp - Wikimedia Commons Query Service IPv4 on wcqs.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T294865 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:42:28] ACKNOWLEDGEMENT - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: user@22656.service daniel_zahn https://phabricator.wikimedia.org/T294865 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:42:28] ACKNOWLEDGEMENT - puppet last run on wcqs1001 is CRITICAL: CRITICAL: Puppet last ran 13 hours ago daniel_zahn https://phabricator.wikimedia.org/T294865 https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:42:28] ACKNOWLEDGEMENT - SSH on wcqs1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T294865 https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:42:28] ACKNOWLEDGEMENT - SSH on wcqs2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T294865 https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:42:29] ACKNOWLEDGEMENT - Check systemd state on wcqs2002 is CRITICAL: CRITICAL - degraded: The following units failed: session-111576.scope,user@112.service daniel_zahn https://phabricator.wikimedia.org/T294865 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:42:29] ACKNOWLEDGEMENT - puppet last run on wcqs2002 is CRITICAL: CRITICAL: Puppet last ran 17 hours ago daniel_zahn https://phabricator.wikimedia.org/T294865 https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:42:29] ACKNOWLEDGEMENT - SSH on wcqs2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T294865 https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:43:13] 10SRE, 10ops-ulsfo: ps1-22-ulsfo Cord, Master_Cord_A, Active Power alerting - https://phabricator.wikimedia.org/T294891 (10RobH) a:05RobH→03ayounsi TLDR: I think we should raise the threshhold. Details: I pulled up the HTTPS MGMT interface on all 4 new CP hosts (cp403[3-6]) and they are all set to evenl... [17:46:41] !log upgrading PHP 7.2 on A:all-mw-eqiad [17:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:33] !log adding BGP peering session to "Liquid Telecommunications" AS30844 on cr2-esams (AMS-IX) [17:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:47] PROBLEM - MariaDB Replica IO: s8 on db2081 is CRITICAL: NRPE: Command check_mariadb_replica_io_state_s8 not defined https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:48:18] !log depool cp4035.ulsfo.wmnet - T290694 [17:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:21] T290694: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 [17:48:33] PROBLEM - MariaDB memory on db2081 is CRITICAL: NRPE: Command check_mariadb_memory not defined https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [17:48:43] PROBLEM - mysqld processes on db2081 is CRITICAL: NRPE: Command check_mysqld not defined https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [17:49:26] !log update codfw cp instances to ATS 8.0.8-1wm5 - T294897 [17:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:51] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp4035.ulsfo.wmnet with OS buster [17:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:57] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4035.ulsfo.wmnet with OS buster [17:51:07] RECOVERY - Check systemd state on an-tool1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:54:34] (03PS1) 10BBlack: Add mmandere to icinga contacts/auth [puppet] - 10https://gerrit.wikimedia.org/r/736545 (https://phabricator.wikimedia.org/T281344) [17:54:38] (03CR) 10Joal: [C: 04-1] "We need to apply the patch to both aqs.yaml and aqs_next.yaml (I did it wrong 3 times, now I remember: )" [puppet] - 10https://gerrit.wikimedia.org/r/736542 (owner: 10Milimetric) [17:57:12] (03PS2) 10Milimetric: role::common::aqs: update mw history in both places [puppet] - 10https://gerrit.wikimedia.org/r/736542 [17:57:58] (03CR) 10BBlack: [C: 03+2] Add mmandere to icinga contacts/auth [puppet] - 10https://gerrit.wikimedia.org/r/736545 (https://phabricator.wikimedia.org/T281344) (owner: 10BBlack) [18:00:05] dduvall and twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211103T1800). [18:00:05] RoanKattouw and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC evening backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211103T1800). [18:00:05] Pchelolo, dbrant, and tgr: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:15] I'm around [18:00:51] * urbanecm waves, but feels like people will self-service? [18:00:54] happy to deploy if needed :) [18:01:08] > If you break AND fix the wikis, you will be rewarded with a sticker [18:01:10] what kind.. [18:01:19] a sticky one [18:01:54] (I'm sold.) [18:02:13] mine are complete no-op [18:02:25] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'staging' . [18:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:44] Pchelolo: go ahead i guess :) [18:02:49] ok [18:03:11] (03PS2) 10Ppchelko: Remove hook set for incident reponse in 2020 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736032 [18:03:20] (03CR) 10Ppchelko: [C: 03+2] Remove hook set for incident reponse in 2020 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736032 (owner: 10Ppchelko) [18:03:22] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox' for release 'main' . [18:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:05] (03Merged) 10jenkins-bot: Remove hook set for incident reponse in 2020 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736032 (owner: 10Ppchelko) [18:04:49] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox-constraints' for release 'main' . [18:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:15] jelto: oh? what's changing? [18:05:42] legoktm: It's only staging-codfw for T251305. Do I cause any issues? [18:05:42] T251305: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 [18:05:51] (03PS3) 10Ppchelko: Clean up temporary variable wgMathUseRestBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710126 (https://phabricator.wikimedia.org/T274436) [18:05:52] RECOVERY - MariaDB memory on db2081 is OK: OK Memory 80% used https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [18:06:00] ah, nope, just curious! [18:06:05] RECOVERY - mysqld processes on db2081 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [18:06:13] !log ppchelko@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:736032|Remove hook set for incident reponse in 2020]] (duration: 01m 03s) [18:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:37] I'm testing the migration path for for all services in staging-codfw. I will add details later to the task if you are interested [18:06:43] (03CR) 10Ppchelko: [C: 03+2] Clean up temporary variable wgMathUseRestBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710126 (https://phabricator.wikimedia.org/T274436) (owner: 10Ppchelko) [18:07:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:18] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox-media' for release 'main' . [18:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:11] (03Merged) 10jenkins-bot: Clean up temporary variable wgMathUseRestBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710126 (https://phabricator.wikimedia.org/T274436) (owner: 10Ppchelko) [18:08:33] !log ran set session sql_log_bin=0; RENAME TABLE wb_changes_dispatch TO T294121_DROP_wb_changes_dispatch; on db1111 (T294121) [18:08:34] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox-syntaxhighlight' for release 'main' . [18:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:35] T294121: Drop wb_changes_dispatch table in production - https://phabricator.wikimedia.org/T294121 [18:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:54] i don't have +2, so unable to self-service :( [18:09:18] (03CR) 10Ahmon Dancy: First rev of WMF docker-gc image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732722 (https://phabricator.wikimedia.org/T294034) (owner: 10Ahmon Dancy) [18:09:52] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' . [18:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:58] !log ppchelko@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:710126|Clean up temporary variable wgMathUseRestBase (T274436)]] (duration: 01m 03s) [18:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:01] T274436: Enable RESTbaseless validation in wikibase - https://phabricator.wikimedia.org/T274436 [18:10:05] dbrant: which patch? [18:10:11] (03CR) 10Legoktm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/736211 (owner: 10Giuseppe Lavagetto) [18:10:26] urbanecm: done [18:10:28] thank you [18:10:32] np [18:10:43] https://gerrit.wikimedia.org/r/c/736042 [18:10:43] dbrant: I can deploy for you in that case [18:10:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:06] I let Martin handle it [18:11:37] (03PS2) 10Urbanecm: Add Android site association file. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736042 (https://phabricator.wikimedia.org/T294776) (owner: 10Dbrant) [18:11:58] (03CR) 10Urbanecm: [C: 03+2] Add Android site association file. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736042 (https://phabricator.wikimedia.org/T294776) (owner: 10Dbrant) [18:12:23] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [18:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:41] (03Merged) 10jenkins-bot: Add Android site association file. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736042 (https://phabricator.wikimedia.org/T294776) (owner: 10Dbrant) [18:13:07] 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10RobH) [18:13:14] 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10RobH) [18:13:19] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [18:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:26] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 (10EBernhardson) [18:13:27] Pchelolo: it looks you left the stagging dir in a weird case. Git status says `Your branch is behind 'origin/master' by 1 commit, and can be fast-forwarded.` [18:13:31] can you fix that please? [18:13:42] urbanecm: oh dang. not done. [18:13:50] forgot the rebase! [18:13:54] *in a weird state [18:14:04] okay, ping me when done please, and note dbrant's patch is merged now :) [18:14:08] urbanecm: fabulous, thanks! [18:14:22] dbrant: don't thank me just yet, it's merged, but not yet deployed :-) [18:14:25] stay on line please :) [18:14:51] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 (10EBernhardson) [18:15:07] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [18:15:07] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [18:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:17] !log ppchelko@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:710126|Clean up temporary variable wgMathUseRestBase (T274436)]] (duration: 01m 02s) [18:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:20] T274436: Enable RESTbaseless validation in wikibase - https://phabricator.wikimedia.org/T274436 [18:15:46] 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10RobH) a:03jijiki @jijiki, The order task T291998#7461853 didn't have racking details, and it seems you are best to provide this for the servers before they arrive? Addition... [18:15:52] ok. done now for sure. [18:15:58] sorry about this urbanecm [18:16:00] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [18:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:02] np [18:16:03] thanks [18:16:59] dbrant: your patch is at mwdebug1001. Can you test please? [18:17:36] (see https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_usage for how to do it; feel free to ask if anything's unclear) [18:17:43] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [18:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:23] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' . [18:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:25] RECOVERY - MariaDB Replica IO: s8 on db2081 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:20:03] dbrant: are you still here, please? :-) [18:20:10] urbanecm: testing... [18:20:12] thanks [18:20:45] urbanecm: ok, looks good! [18:20:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:50] thanks, syncing [18:21:08] also looks good on my end [18:22:03] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: keepalived: set same priority on the 2 VRRP instances [puppet] - 10https://gerrit.wikimedia.org/r/736548 (https://phabricator.wikimedia.org/T294956) [18:22:16] !log urbanecm@deploy1002 Synchronized docroot/wikipedia.org/: 2331d061b95ba3fc4de8844008fac93ce18f9063: Add Android site association file (T294776) (duration: 01m 02s) [18:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:19] T294776: Create and host assetlinks.json file. (Android 12 deeplinking support) - https://phabricator.wikimedia.org/T294776 [18:22:23] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [18:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:03] dbrant: should be live now [18:23:06] anything else? [18:24:14] !log rebooting ganeti-test2002 with fixed /etc/network/interfaces [18:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:43] (03CR) 10Ottomata: "Ya change it where Ben pointed to, the profile one will override this default value in the module." [puppet] - 10https://gerrit.wikimedia.org/r/736503 (https://phabricator.wikimedia.org/T292087) (owner: 10Razzi) [18:25:02] tgr: hey, are you around? [18:25:13] (03CR) 10Nskaggs: Add further engineering managers for ops: approval (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736451 (owner: 10Muehlenhoff) [18:25:15] 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10Joe) @RobH @jijiki is on PTO at the moment. [18:25:16] o/ [18:26:16] tgr: can you self-service, please? [18:26:21] sure [18:26:31] so, all yours :) [18:26:35] 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10Joe) a:05jijiki→03Joe [18:27:12] urbanecm: i'm not seeing that file live yet [18:27:13] (03PS1) 10MMandere: icinga: Add MMandere to icinga contacts/auth [puppet] - 10https://gerrit.wikimedia.org/r/736549 (https://phabricator.wikimedia.org/T281344) [18:27:24] dbrant: where are you looking for it? [18:27:52] (03CR) 10Legoktm: "We already have hieradata/common/role/common/lists.yaml, why is this in a different file? Have we been doing it wrong?" [labs/private] - 10https://gerrit.wikimedia.org/r/736209 (owner: 10Giuseppe Lavagetto) [18:28:06] it appears to work from my end https://www.irccloud.com/pastebin/DyNKW5cw/ [18:28:09] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) I migrated all services in `staging-codfw` to helm3 using the snippet https://phabricator.wikimedia.org/P17670. I fixed minor issues in the charts but the general process was quit... [18:28:21] (03PS2) 10Gergő Tisza: Enable GrowthExperiments image recommendations on ar,bn,cs,vi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736320 (https://phabricator.wikimedia.org/T294878) [18:28:49] cached 404s? [18:29:36] might be [18:29:48] (03CR) 10Ssingh: [C: 03+1] icinga: Add MMandere to icinga contacts/auth [puppet] - 10https://gerrit.wikimedia.org/r/736549 (https://phabricator.wikimedia.org/T281344) (owner: 10MMandere) [18:30:15] (03PS3) 10Legoktm: lists: allow banning requests using a hiera array [puppet] - 10https://gerrit.wikimedia.org/r/736211 (owner: 10Giuseppe Lavagetto) [18:30:24] (03PS3) 10Milimetric: role::common::aqs: update mw history in both places [puppet] - 10https://gerrit.wikimedia.org/r/736542 [18:30:45] (03CR) 10Gergő Tisza: [C: 03+2] Enable GrowthExperiments image recommendations on ar,bn,cs,vi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736320 (https://phabricator.wikimedia.org/T294878) (owner: 10Gergő Tisza) [18:30:50] (03CR) 10Vgutierrez: [C: 03+1] icinga: Add MMandere to icinga contacts/auth [puppet] - 10https://gerrit.wikimedia.org/r/736549 (https://phabricator.wikimedia.org/T281344) (owner: 10MMandere) [18:31:05] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:31:29] i'm getting a couple of 301s, then a 404 [18:31:46] (03Merged) 10jenkins-bot: Enable GrowthExperiments image recommendations on ar,bn,cs,vi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736320 (https://phabricator.wikimedia.org/T294878) (owner: 10Gergő Tisza) [18:31:57] (03CR) 10MMandere: [C: 03+2] icinga: Add MMandere to icinga contacts/auth [puppet] - 10https://gerrit.wikimedia.org/r/736549 (https://phabricator.wikimedia.org/T281344) (owner: 10MMandere) [18:33:19] dbrant: on which URI please? :) [18:33:46] the same one from your paste: https://wikipedia.org/.well-known/assetlinks.json [18:34:08] lemme try to purge it [18:34:12] (from the frontend caches) [18:34:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:38] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:34:50] !log Purge https://en.wikipedia.org/.well-known/assetlinks.json, https://www.wikipedia.org/.well-known/assetlinks.json and https://wikipedia.org/.well-known/assetlinks.json (T294776) [18:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:53] dbrant: can you try now? [18:34:53] T294776: Create and host assetlinks.json file. (Android 12 deeplinking support) - https://phabricator.wikimedia.org/T294776 [18:35:28] (03CR) 10Ryan Kemper: [C: 03+1] "PCC looks as expected: https://puppet-compiler.wmflabs.org/compiler1002/1042/elastic1049.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/721364 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [18:36:18] urbanecm: works! [18:36:22] great! [18:36:28] thx [18:36:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [18:36:55] * urbanecm upgrades his Google Pixel to android 12 [18:37:07] 10SRE, 10ops-eqiad: eqiad: add VC-links IDs to Netbox - https://phabricator.wikimedia.org/T268750 (10Cmjohnson) 05Open→03Resolved rows C and D completed. resoloved all cable warnings/errors for eqiad [18:38:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:32] (03CR) 10Legoktm: [C: 03+2] lists: allow banning requests using a hiera array (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736211 (owner: 10Giuseppe Lavagetto) [18:40:44] !log re-enabling puppet on lists1001 [18:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:01] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:46:26] (03PS1) 10Legoktm: lists: Set profile::base::firewall::block_abuse_nets: true [puppet] - 10https://gerrit.wikimedia.org/r/736552 [18:47:19] (03CR) 10Legoktm: lists: allow banning requests using a hiera array (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736211 (owner: 10Giuseppe Lavagetto) [18:47:59] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:48:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:05] (03CR) 10Ladsgroup: [C: 03+1] lists: Set profile::base::firewall::block_abuse_nets: true [puppet] - 10https://gerrit.wikimedia.org/r/736552 (owner: 10Legoktm) [18:51:07] (03CR) 10Razzi: [C: 03+2] role::common::aqs: update mw history in both places [puppet] - 10https://gerrit.wikimedia.org/r/736542 (owner: 10Milimetric) [18:51:20] (03PS1) 10Gergő Tisza: Use $wgGEImageRecommendationServiceUseTitles on Growth Add Image wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736553 [18:51:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:33] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4035.ulsfo.wmnet with OS buster [18:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:41] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4035.ulsfo.wmnet with OS buster completed: - cp4035 (**WARN**... [18:55:13] (03CR) 10Gergő Tisza: [C: 03+2] Use $wgGEImageRecommendationServiceUseTitles on Growth Add Image wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736553 (owner: 10Gergő Tisza) [18:55:30] !log razzi@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. - razzi@cumin1001 [18:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:17] (03Merged) 10jenkins-bot: Use $wgGEImageRecommendationServiceUseTitles on Growth Add Image wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736553 (owner: 10Gergő Tisza) [18:59:45] !log razzi@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. - razzi@cumin1001 [18:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:05] dduvall and twentyafterfour: That opportune time is upon us again. Time for a MediaWiki train - Utc-7 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211103T1900). [19:00:22] dduvall: twentyafterfour I'm running into the window a bit [19:00:43] tgr: no problem. i'm in a meeting that's running late :) [19:01:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:08] 10ops-eqiad, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): (Need By: TBD) rack/setup/install prometheus100[56] - https://phabricator.wikimedia.org/T294967 (10RobH) [19:02:34] 10ops-eqiad, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): (Need By: TBD) rack/setup/install prometheus100[56] - https://phabricator.wikimedia.org/T294967 (10RobH) [19:03:41] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is CRITICAL: 1.206e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [19:03:48] (03PS1) 10Jgiannelos: tegola-vector-tiles: Setup cronjob parallelism [deployment-charts] - 10https://gerrit.wikimedia.org/r/736554 (https://phabricator.wikimedia.org/T293366) [19:04:34] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:04:35] (03PS5) 10Ahmon Dancy: php-fpm: Add settings to control debuggability [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732737 [19:05:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:02] (03CR) 10Ahmon Dancy: [C: 03+1] php-fpm: Add settings to control debuggability (035 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732737 (owner: 10Ahmon Dancy) [19:07:28] (03PS1) 10Gergő Tisza: Revert GrowthExperiments Add Image pilot deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736555 (https://phabricator.wikimedia.org/T294878) [19:08:35] (03CR) 10Gergő Tisza: [C: 03+2] Revert GrowthExperiments Add Image pilot deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736555 (https://phabricator.wikimedia.org/T294878) (owner: 10Gergő Tisza) [19:08:54] 10SRE, 10Security: apache modsec rules deployment with scap - https://phabricator.wikimedia.org/T224887 (10Reedy) [19:09:21] (03Merged) 10jenkins-bot: Revert GrowthExperiments Add Image pilot deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736555 (https://phabricator.wikimedia.org/T294878) (owner: 10Gergő Tisza) [19:09:28] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is CRITICAL: 1.206e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [19:10:28] !log UTC evening deploys done [19:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:47] dduvall: I'm done [19:10:55] excellent. thank you [19:12:39] doing group0 first today, then group1 after... let's say 30 min-ish of clear logs [19:13:10] (03PS1) 10Dduvall: group0 wikis to 1.38.0-wmf.7 refs T293948 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736558 [19:13:12] (03CR) 10Dduvall: [C: 03+2] group0 wikis to 1.38.0-wmf.7 refs T293948 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736558 (owner: 10Dduvall) [19:13:55] (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.7 refs T293948 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736558 (owner: 10Dduvall) [19:14:14] (03PS1) 10RLazarus: admin: Add cmelo and jcarvalho to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/736559 (https://phabricator.wikimedia.org/T294927) [19:15:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:21] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.7 refs T293948 [19:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:23] T293948: 1.38.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T293948 [19:17:18] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32110/console" [puppet] - 10https://gerrit.wikimedia.org/r/736559 (https://phabricator.wikimedia.org/T294927) (owner: 10RLazarus) [19:17:56] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:18:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:08] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 (10EBernhardson) Some random info i looked up: * grafana reports free memory of at least 90G across instances. This is typical,... [19:19:50] !log 1.38.0-wmf.7 now on group0. no new errors. leaving ~ 30 minutes before promoting group1 (T293948) [19:19:52] (03CR) 10RLazarus: [V: 03+1 C: 03+2] admin: Add cmelo and jcarvalho to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/736559 (https://phabricator.wikimedia.org/T294927) (owner: 10RLazarus) [19:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:34] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:26:14] !log pool cp4035.ulsfo.wmnet - T290694 [19:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:18] T290694: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 [19:28:10] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for JCarvalho - https://phabricator.wikimedia.org/T294929 (10RLazarus) @JCarvalho Welcome to the Foundation! I've added you to the wmf LDAP group, ` rzl@mwmaint1002:~$ ldapsearch -x cn=wmf | grep jcarvalho member: uid=jcarvalho,ou=people... [19:28:23] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for CMelo - https://phabricator.wikimedia.org/T294927 (10RLazarus) 05Open→03Resolved a:03RLazarus @CMelo Welcome to the Foundation! I've added you to the wmf LDAP group, ` rzl@mwmaint1002:~$ ldapsearch -x cn=wmf | grep cmelo member... [19:28:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:58] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:29:27] 10SRE-Access-Requests: Requesting access to restricted for htriedman - https://phabricator.wikimedia.org/T294970 (10Htriedman) [19:29:28] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for JCarvalho - https://phabricator.wikimedia.org/T294929 (10RLazarus) 05Open→03Resolved a:03RLazarus [19:29:52] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 113 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:30:57] dduvall: fyi ^ [19:31:06] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 18 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:31:17] (not sure if train-related) [19:32:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:11] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=wcqs2003.codfw.wmnet [19:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:27] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=wcqs2003.codfw.wmnet [19:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:24] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:35:50] !log depooled wcqs2003 (pooled=inactive) because Icinga alerts that servers are down but pooled. not in production yet but issues (T294961) [19:35:50] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:53] T294961: Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 [19:37:21] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/736552 (owner: 10Legoktm) [19:37:28] rzl: there's usually a spike in timeouts related to train but typically it subsides by now [19:37:48] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:37:50] execution time limits exceeded that is [19:41:17] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.67:443]) https://wikitech.wikimedia.org/wiki/PyBal [19:41:31] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.67:443]) https://wikitech.wikimedia.org/wiki/PyBal [19:41:58] hrmmm.. So I just depooled the broken wcqs servers to fix these alerts [19:41:58] mutante: that's not your depool is it ^ [19:42:03] then rescheduled them and now this [19:42:07] I think you depooled them [19:42:15] yea, exactly [19:42:16] And it's ended up with 0 pooled servers [19:42:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:36] So it's unhappy because 0 servers means the service doesn't exist according to IPVs [19:42:36] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:42:50] And pybal thinks it's should mutante [19:42:52] dduvall: ah cool [19:43:56] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=wcqs2003.codfw.wmnet [19:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:14] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 13 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [19:46:07] RhinosF1: doing pooled=no vs pooled=inactive [19:46:25] but normally it would have to be inactive if it will be down for weeks [19:46:54] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for htriedman - https://phabricator.wikimedia.org/T294970 (10RLazarus) [19:47:32] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-access for NRodriguez - https://phabricator.wikimedia.org/T291508 (10NRodriguez) Apologies for the delay! And thanks so much for moving quickly on this. Here is my key, generated with the second command as a `.pub` file > ssh-rsa AAAAB3N... [19:48:07] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-access for NRodriguez - https://phabricator.wikimedia.org/T291508 (10NRodriguez) a:05NRodriguez→03None [19:48:14] mutante: ok [19:48:18] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-access for NRodriguez - https://phabricator.wikimedia.org/T291508 (10NRodriguez) a:03Dzahn [19:48:38] * RhinosF1 doesn't understand it well enough to say on more [19:49:10] RhinosF1: "no" means "is in config but gets no traffic". inactive means it's not in the config at all [19:49:20] Ah right [19:49:25] I assume no is fine [19:49:32] They're not doing anything [19:49:37] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10RobH) [19:49:41] Wcqs isn't ready yet [19:49:41] (03PS1) 10Dduvall: group1 wikis to 1.38.0-wmf.7 refs T293948 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736561 [19:49:43] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.38.0-wmf.7 refs T293948 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736561 (owner: 10Dduvall) [19:49:51] yea, they do less than nothing actually [19:49:58] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:50:03] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10RobH) [19:50:30] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.7 refs T293948 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736561 (owner: 10Dduvall) [19:50:46] mutante: apart from alert a lot [19:51:33] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for htriedman - https://phabricator.wikimedia.org/T294970 (10Urbanecm) Hello everyone, I'm not sure why this is a request to `restricted`. That user group is normally used for people to run maintenance queries or (write) queries on production databa... [19:51:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:46] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.7 refs T293948 [19:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:48] T293948: 1.38.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T293948 [19:51:52] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:52:49] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.7 refs T293948 (duration: 01m 03s) [19:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:18] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for htriedman - https://phabricator.wikimedia.org/T294970 (10Urbanecm) Actually...htriedman appears to already //be// in the analytics-privatedata-users group (albeit with a different SSH key), so I don't think anything's needed here. @Htriedman Ca... [19:55:48] 10SRE, 10MediaWiki-General, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), and 5 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Urbanecm) @Pchelolo Ping? Can you please review the renames that failed?... [19:55:51] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for htriedman - https://phabricator.wikimedia.org/T294970 (10RLazarus) @Htriedman Hi, I can get you set up here! I see you're already a member of `analytics_privatedata_users` but with a different SSH key -- assuming that you can still use that key,... [19:56:30] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for htriedman - https://phabricator.wikimedia.org/T294970 (10RLazarus) Ah sorry, crossed in-flight -- @Htriedman please go ahead with @Urbanecm's request and let us know how it goes. [19:58:39] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2008 - https://phabricator.wikimedia.org/T294973 (10RobH) [19:58:48] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2008 - https://phabricator.wikimedia.org/T294973 (10RobH) [19:59:01] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-access for NRodriguez - https://phabricator.wikimedia.org/T291508 (10Dzahn) 05Stalled→03Open [19:59:34] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:00:05] dduvall and twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211103T1900). [20:00:05] chrisalbon and accraze: I, the Bot under the Fountain, call upon thee, The Deployer, to do Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211103T2000). [20:00:15] 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup2008 - https://phabricator.wikimedia.org/T294973 (10RobH) [20:00:46] urbanecm: just saw the task, thank you. rzl is on clinic duty this week, so I will be reassigning it to them, if that's ok [20:01:06] sukhe: absolutely! I just followed the topic in this chan [20:01:19] not sure if either sukhe or rzl can update it, but if not, let me know, and I can :) [20:01:43] ah yes. someone should update the topic as well! [20:02:01] sorry for the ping, i was 100% convinced you're on duty today hehe :) [20:02:26] no problem! [20:03:34] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:03:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:20] (03PS1) 10Herron: centrallog: prep rsync from centrallog2001 -> centrallog2002 [puppet] - 10https://gerrit.wikimedia.org/r/736563 [20:05:18] urbanecm: thanks! [20:05:22] (03PS2) 10Herron: centrallog: prep rsync from centrallog2001 -> centrallog2002 [puppet] - 10https://gerrit.wikimedia.org/r/736563 [20:05:47] rzl: no problem :) [20:06:03] (although I'm not 100% sure what i'm being thanked for, if the T294970 comments or topic :D) [20:06:22] ahaha, one and then the other :D [20:06:40] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:06:49] even better! :) [20:06:58] glad i can help, as always. [20:07:07] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 237, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:07:20] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/736563 (owner: 10Herron) [20:07:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:34] (03PS3) 10Herron: centrallog: prep rsync from centrallog2001 -> centrallog2002 [puppet] - 10https://gerrit.wikimedia.org/r/736563 [20:09:15] 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup1008 - https://phabricator.wikimedia.org/T294974 (10RobH) [20:10:18] 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup1008 - https://phabricator.wikimedia.org/T294974 (10RobH) [20:15:54] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:17:00] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:17:26] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:18:00] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:21:04] (03CR) 10Ottomata: [V: 03+1] Add role::analytics_cluster::database::meta on an-db100[12] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736019 (https://phabricator.wikimedia.org/T284150) (owner: 10Ottomata) [20:21:17] (03PS10) 10Ottomata: Add role::analytics_cluster::database::meta on an-db100[12] [puppet] - 10https://gerrit.wikimedia.org/r/736019 (https://phabricator.wikimedia.org/T284150) [20:21:40] (03PS11) 10Ottomata: Add role::analytics_cluster::database::meta on an-db100[12] [puppet] - 10https://gerrit.wikimedia.org/r/736019 (https://phabricator.wikimedia.org/T284150) [20:26:47] (03CR) 10Krinkle: [C: 03+1] Disable DPL on wikimania2016wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710482 (https://phabricator.wikimedia.org/T287916) (owner: 10Ladsgroup) [20:36:22] (03PS12) 10Ottomata: Add role::analytics_cluster::database::meta on an-db100[12] [puppet] - 10https://gerrit.wikimedia.org/r/736019 (https://phabricator.wikimedia.org/T284150) [20:37:29] 10SRE, 10ops-codfw: mw2280 unresponsive to powercycle and hardreset - https://phabricator.wikimedia.org/T290708 (10wiki_willy) a:03Papaul [20:37:35] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for htriedman - https://phabricator.wikimedia.org/T294970 (10Jcross) @RLazarus You have my approval [20:38:39] 10SRE, 10ops-codfw: mw2280 unresponsive to powercycle and hardreset - https://phabricator.wikimedia.org/T290708 (10wiki_willy) Just a quick update - Wolfgang requested that we buy a replacement server for this, so currently working on getting the budget approval to get it procured. Thanks, Willy [20:39:19] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32111/console" [puppet] - 10https://gerrit.wikimedia.org/r/736019 (https://phabricator.wikimedia.org/T284150) (owner: 10Ottomata) [20:41:33] (03PS13) 10Ottomata: Add role::analytics_cluster::database::meta on an-db100[12] [puppet] - 10https://gerrit.wikimedia.org/r/736019 (https://phabricator.wikimedia.org/T284150) [20:47:25] (03PS14) 10Ottomata: Add role::analytics_cluster::database::meta on an-db100[12] [puppet] - 10https://gerrit.wikimedia.org/r/736019 (https://phabricator.wikimedia.org/T284150) [20:49:43] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32112/console" [puppet] - 10https://gerrit.wikimedia.org/r/736019 (https://phabricator.wikimedia.org/T284150) (owner: 10Ottomata) [20:53:44] (03PS1) 10Ryan Kemper: wcqs: state change production->monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/736564 (https://phabricator.wikimedia.org/T294865) [20:55:48] (03CR) 10Ebernhardson: [C: 03+1] wcqs: state change production->monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/736564 (https://phabricator.wikimedia.org/T294865) (owner: 10Ryan Kemper) [20:55:51] !log upgrading PHP 7.2 on A:parsoid [20:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:56] (03CR) 10Ottomata: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/32112/an-db1001.eqiad.wmnet/fulldiff.html and https://puppet-compiler.wmflabs.org/compile" [puppet] - 10https://gerrit.wikimedia.org/r/736019 (https://phabricator.wikimedia.org/T284150) (owner: 10Ottomata) [20:57:25] (03PS15) 10Ottomata: Add role::analytics_cluster::database::meta on an-db100[12] [puppet] - 10https://gerrit.wikimedia.org/r/736019 (https://phabricator.wikimedia.org/T284150) [20:58:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [21:00:10] !log upgrading PHP 7.2 on A:snapshot [21:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:31] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-access for NRodriguez - https://phabricator.wikimedia.org/T291508 (10RLazarus) [21:03:39] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-access for NRodriguez - https://phabricator.wikimedia.org/T291508 (10RLazarus) a:05Dzahn→03RLazarus @NRodriguez Thanks! The only thing left is manager signoff, then I can go ahead and make the change. @DannyH Can you please comment h... [21:05:50] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Patch-For-Review: wcqs1002 and wcqs2001 unresponsive - https://phabricator.wikimedia.org/T294865 (10RKemper) [21:06:04] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 (10RKemper) [21:06:10] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Patch-For-Review: wcqs1002 and wcqs2001 unresponsive - https://phabricator.wikimedia.org/T294865 (10RKemper) [21:08:14] (03PS1) 10PipelineBot: image-suggestion-api: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/736565 [21:12:07] !log upgrading PHP 7.2 on labweb, deployment-servers [21:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:33] uh think I broke wikitech one sec [21:15:36] (03PS1) 10Andrew Bogott: cookbook sre: update SREBatchBase/SREBatchRunnerBase with minor fixes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736568 [21:15:38] (03PS1) 10Andrew Bogott: cookbooks sre: update run_scripts to accept a list of scripts not functions [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736569 [21:15:40] (03PS1) 10Andrew Bogott: cookbooks.sre: update to use correct icinga_hosts instance [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736570 [21:15:42] (03PS1) 10Andrew Bogott: sre: add conftool aware SREBatchRunnerBase [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736571 [21:15:44] (03PS1) 10Andrew Bogott: sre.hosts.reimage: don't fail on new DC [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736572 [21:15:46] (03PS1) 10Andrew Bogott: upgrade-varnish: support frontend instance only [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736573 [21:15:48] (03PS1) 10Andrew Bogott: cookbook sre.dns.wipe-cache: cookbook to clear stale DNS entries [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736574 [21:15:50] (03PS1) 10Andrew Bogott: sre.hosts.reimage: handle switches without virtual chassis [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736575 (https://phabricator.wikimedia.org/T284471) [21:15:54] (03PS1) 10Andrew Bogott: sre.hosts.ipmi-password-reset: support new hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736576 [21:15:56] (03PS1) 10Andrew Bogott: sre.hosts.reimage: Fix hostname on example [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736577 [21:15:58] (03PS1) 10Andrew Bogott: sre.hosts.reimage: adapt confctl message [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736578 [21:16:00] (03PS1) 10Andrew Bogott: sre.ganeti.makevm: Relax globbing for interface name used in bridges [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736579 [21:16:02] (03PS1) 10Andrew Bogott: start_instance_with_prefix: return (id, fqdn) rather than just fqdn [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736580 [21:16:20] andrewbogott, bd808: are both labweb1001 and labweb1002 supposed to be pooled? or just one? [21:16:24] to serve wikitech [21:16:35] (03Abandoned) 10Andrew Bogott: start_instance_with_prefix: return (id, fqdn) rather than just fqdn [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736580 (owner: 10Andrew Bogott) [21:16:39] (03Abandoned) 10Andrew Bogott: sre.ganeti.makevm: Relax globbing for interface name used in bridges [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736579 (owner: 10Andrew Bogott) [21:16:40] legoktm: I thought they were active/active [21:16:44] (03Abandoned) 10Andrew Bogott: sre.hosts.reimage: adapt confctl message [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736578 (owner: 10Andrew Bogott) [21:16:48] (03Abandoned) 10Andrew Bogott: sre.hosts.reimage: Fix hostname on example [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736577 (owner: 10Andrew Bogott) [21:16:52] (03Abandoned) 10Andrew Bogott: sre.hosts.ipmi-password-reset: support new hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736576 (owner: 10Andrew Bogott) [21:16:55] (03Abandoned) 10Andrew Bogott: sre.hosts.reimage: handle switches without virtual chassis [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736575 (https://phabricator.wikimedia.org/T284471) (owner: 10Andrew Bogott) [21:17:00] (03Abandoned) 10Andrew Bogott: cookbook sre.dns.wipe-cache: cookbook to clear stale DNS entries [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736574 (owner: 10Andrew Bogott) [21:17:04] (03Abandoned) 10Andrew Bogott: upgrade-varnish: support frontend instance only [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736573 (owner: 10Andrew Bogott) [21:17:04] ok [21:17:07] fixed [21:17:08] (03Abandoned) 10Andrew Bogott: sre.hosts.reimage: don't fail on new DC [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736572 (owner: 10Andrew Bogott) [21:17:15] (03Abandoned) 10Andrew Bogott: sre: add conftool aware SREBatchRunnerBase [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736571 (owner: 10Andrew Bogott) [21:17:19] (03Abandoned) 10Andrew Bogott: cookbooks.sre: update to use correct icinga_hosts instance [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736570 (owner: 10Andrew Bogott) [21:17:23] (03Abandoned) 10Andrew Bogott: cookbooks sre: update run_scripts to accept a list of scripts not functions [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736569 (owner: 10Andrew Bogott) [21:17:28] (03Abandoned) 10Andrew Bogott: cookbook sre: update SREBatchBase/SREBatchRunnerBase with minor fixes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736568 (owner: 10Andrew Bogott) [21:17:44] I screwed up while running the restart-php7.2-fpm script, and it depooled both from LVS at the same time [21:18:02] s/LVS/conftool [21:18:17] such things happen [21:19:01] (03PS1) 10Andrew Bogott: start_instance_with_prefix: return (id, fqdn) rather than just fqdn [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736581 [21:22:08] legoktm: yes, both pooled [21:22:10] in theory [21:22:28] they are both pooled now! [21:23:31] (03CR) 10jerkins-bot: [V: 04-1] start_instance_with_prefix: return (id, fqdn) rather than just fqdn [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736581 (owner: 10Andrew Bogott) [21:26:26] (03PS2) 10Ryan Kemper: wcqs: state change production->lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/736564 (https://phabricator.wikimedia.org/T294865) [21:26:37] !log upgrading/restarting apache2 on A:all-mw-codfw [21:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:13] (03PS3) 10Ryan Kemper: wcqs: state change production->lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/736564 (https://phabricator.wikimedia.org/T294961) [21:32:20] (03PS1) 10Ryan Kemper: Revert "wcqs: add discovery record" [dns] - 10https://gerrit.wikimedia.org/r/736585 (https://phabricator.wikimedia.org/T294961) [21:32:45] (03PS2) 10Ryan Kemper: Revert "wcqs: add discovery record" [dns] - 10https://gerrit.wikimedia.org/r/736585 (https://phabricator.wikimedia.org/T294961) [21:33:40] (03PS3) 10Ryan Kemper: Revert "wcqs: add discovery record" [dns] - 10https://gerrit.wikimedia.org/r/736585 (https://phabricator.wikimedia.org/T294961) [21:34:25] (03PS4) 10Ryan Kemper: wcqs: state change production->lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/736564 (https://phabricator.wikimedia.org/T294961) [21:38:33] !log upgrading/restarting apache2 on A:all-mw-eqiad [21:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:57] (03PS2) 10Andrew Bogott: start_instance_with_prefix: return (id, fqdn) rather than just fqdn [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736581 [21:40:05] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 2 others: Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 (10EBernhardson) [21:42:41] (03CR) 10RLazarus: [C: 03+1] Revert "wcqs: add discovery record" [dns] - 10https://gerrit.wikimedia.org/r/736585 (https://phabricator.wikimedia.org/T294961) (owner: 10Ryan Kemper) [21:44:07] (03CR) 10Ryan Kemper: [C: 03+2] Revert "wcqs: add discovery record" [dns] - 10https://gerrit.wikimedia.org/r/736585 (https://phabricator.wikimedia.org/T294961) (owner: 10Ryan Kemper) [21:45:46] !log T294961 [WCQS] Merged https://gerrit.wikimedia.org/r/c/operations/dns/+/736585, running `ryankemper@authdns1001:~$ sudo -i authdns-update` [21:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:50] T294961: Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 [21:47:33] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 2 others: Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 (10RKemper) ### DNS changes from https://gerrit.wikimedia.org/r/c/operations/dns/+/736585 ` ryankemper@authdns1001:~$ sudo -i authdns-update Updating au... [21:47:45] !log T294961 [WCQS] DNS changes rolled out, proceeding to the `lvs_setup` step: https://gerrit.wikimedia.org/r/c/operations/puppet/+/736564 [21:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:55] (03CR) 10Ryan Kemper: [C: 03+2] wcqs: state change production->lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/736564 (https://phabricator.wikimedia.org/T294961) (owner: 10Ryan Kemper) [21:51:11] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 (2.44.10) - https://phabricator.wikimedia.org/T193352 (10JoKalliauer) [21:51:54] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10JoKalliauer) [21:51:56] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10JoKalliauer) [21:51:58] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 (2.44.10) - https://phabricator.wikimedia.org/T193352 (10JoKalliauer) [21:52:08] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for htriedman - https://phabricator.wikimedia.org/T294970 (10Htriedman) Hi @Urbanecm, thanks for the quick response and the helpful pointer. I've been able to get into `centralauth` by running `analytics-mysql centralauth`, and can query `centralaut... [21:53:32] !log T294961 [WCQS] Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/736564 and successfully ran `ryankemper@cumin1001:~$ sudo cumin 'A:icinga or A:dns-auth' run-puppet-agent` [21:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:35] T294961: Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 [21:54:11] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for htriedman - https://phabricator.wikimedia.org/T294970 (10Urbanecm) 05Open→03Declined Let's call this declined then :). It's usually better to use analytics-related privs for research purposes. As I said above, with mwmaint access, you can a... [21:54:52] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 2 others: Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 (10RKemper) #### Puppet changes ` ryankemper@puppetmaster1001:~$ sudo puppet-merge Fetching new commits from: https://gerrit.wikimedia.org/r/labs/priva... [21:56:01] !log T294961 [WCQS] Forcing recheck of `PyBal IPVS diff check` and `PyBal backends health check` [21:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:25] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for htriedman - https://phabricator.wikimedia.org/T294970 (10Htriedman) Totally understand. Thanks for the tips! [22:00:56] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10JoKalliauer) Another issue reported downstream: https://commons.wikimedia.org/wiki/User_talk:Chen-Pan_Liao#Your_File:R... [22:08:45] (03PS1) 10Bartosz Dziewoński: Make reply tool available as opt-out on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736588 (https://phabricator.wikimedia.org/T294591) [22:12:37] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=wcqs2001.codfw.wmnet [22:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:45] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=wcqs2002.codfw.wmnet [22:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:43] jouncebot: next [22:23:43] In 0 hour(s) and 36 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211103T2300) [22:24:20] !log restarting phabricator to apply updates. [22:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:53] !log restarted php7.3-fpm on phab1001 [22:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:40] (03PS1) 10Andrew Bogott: wmcs-enc-cli: added set_prefix_roles [puppet] - 10https://gerrit.wikimedia.org/r/736590 [22:26:24] (03PS2) 10Andrew Bogott: wmcs-enc-cli: added set_prefix_roles subcommand [puppet] - 10https://gerrit.wikimedia.org/r/736590 [22:32:40] !log uploaded scap 4.0.3 to apt.wm.o for buster and stretch (T294966) [22:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:44] T294966: Deploy Scap version 4.0.3 - https://phabricator.wikimedia.org/T294966 [22:34:20] !log upgraded apache2 on lists1001 [22:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:49] !log upgrading scap on canaries (T294966) [22:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:52] T294966: Deploy Scap version 4.0.3 - https://phabricator.wikimedia.org/T294966 [22:41:26] (03PS1) 10Gergő Tisza: Add Image: add HTTP proxy config [extensions/GrowthExperiments] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/736518 (https://phabricator.wikimedia.org/T290949) [22:42:07] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.67:443]) daniel_zahn pybal restart - setup pending https://wikitech.wikimedia.org/wiki/PyBal [22:42:08] ACKNOWLEDGEMENT - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wcqs_443: Servers wcqs2003.codfw.wmnet are marked down but pooled daniel_zahn pybal restart - setup pending https://wikitech.wikimedia.org/wiki/PyBal [22:42:08] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.67:443]) daniel_zahn pybal restart - setup pending https://wikitech.wikimedia.org/wiki/PyBal [22:42:08] ACKNOWLEDGEMENT - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wcqs_443: Servers wcqs2003.codfw.wmnet are marked down but pooled daniel_zahn pybal restart - setup pending https://wikitech.wikimedia.org/wiki/PyBal [22:46:38] (03PS1) 10Gergő Tisza: Add Image: Harden API response parsing [extensions/GrowthExperiments] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/736519 [22:47:04] !log upgraded scap on A:restbase (T294936) [22:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:07] T294936: RESTBase deployment fails with scap internal error - https://phabricator.wikimedia.org/T294936 [22:48:32] !log ppchelko@deploy1002 Started deploy [restbase/deploy@664a2f8]: Add new wikis T292422 T294587 T294588 [22:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:36] T292422: Add amiwiki to RESTBase - https://phabricator.wikimedia.org/T292422 [22:48:36] T294587: Add pwnwiki to RESTBase - https://phabricator.wikimedia.org/T294587 [22:48:37] T294588: Add lmowiktionary to RESTBase - https://phabricator.wikimedia.org/T294588 [22:48:42] !log ppchelko@deploy1002 Finished deploy [restbase/deploy@664a2f8]: Add new wikis T292422 T294587 T294588 (duration: 00m 10s) [22:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:48] (03PS1) 10Gergő Tisza: Enable GrowthExperiments image recommendations on ar,bn,cs,vi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736593 (https://phabricator.wikimedia.org/T294878) [22:56:28] (03PS1) 10Dzahn: cumin: add parsoid-canary to mw-canary and reuse other aliases [puppet] - 10https://gerrit.wikimedia.org/r/736594 (https://phabricator.wikimedia.org/T294802) [22:57:27] (03PS2) 10Dzahn: cumin: add parsoid-canary to mw-canary and reuse other aliases [puppet] - 10https://gerrit.wikimedia.org/r/736594 (https://phabricator.wikimedia.org/T294802) [22:58:28] mutante: I realized a few minutes ago that for some reason mw-canary doesn't include mw-jobrunner-canary [22:58:46] legoktm: yea, that is what my comment is about there [22:58:55] ack [22:59:00] as long as "mw" means "appservers only" and not "all of mw" [22:59:09] but I am suggesting another change [22:59:12] to change that part [22:59:26] but that needs more eyes before changing it I think [22:59:35] people might be used to the meaning of "mw" [22:59:42] mhm [22:59:52] while the "add parsoid-canary to other canaries" we can just do [22:59:59] there's also all-mw-{eqiad,codfw} [23:00:05] RoanKattouw and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211103T2300). [23:00:05] tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:10] yea, exactly [23:00:35] please don't deploy yet, we're trying to figure out a scap issue [23:01:16] !log legoktm@deploy1002 Started deploy [restbase/deploy@664a2f8]: (no justification provided) [23:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:49] legoktm: worse, kind of bugs me that the puppet role names are: mediawiki::appserver::canary_api but NOT mediawiki::appserver::canary :) [23:02:07] !log legoktm@deploy1002 Finished deploy [restbase/deploy@664a2f8]: (no justification provided) (duration: 00m 50s) [23:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:36] legoktm: ack, please ping me when done [23:09:08] (03PS1) 10Dbrant: Create alias for Android site association file. [puppet] - 10https://gerrit.wikimedia.org/r/736595 (https://phabricator.wikimedia.org/T294776) [23:14:45] I'm rolling back scap, it'll be like 10 minutes [23:16:45] (03PS1) 10Dzahn: cumin: reorganize mediawiki aliases [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) [23:18:51] (03PS3) 10Juan90264: Add Wikivoyage in wgImportSources to enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736520 (https://phabricator.wikimedia.org/T294928) [23:19:28] (03PS2) 10Dzahn: cumin: reorganize mediawiki aliases [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) [23:20:52] !log uploaded scap 4.0.3-1+really4.0.2 to apt.wm.o for buster/stretch [23:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:19] (03PS2) 10Dbrant: Create alias for Android site association file. [puppet] - 10https://gerrit.wikimedia.org/r/736595 (https://phabricator.wikimedia.org/T294776) [23:21:39] PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:22:48] !log reverted canaries back to scap 4.0.2 [23:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:13] tgr: ok, you should be set now, ping me if anything seems off [23:24:15] thanks legoktm! [23:25:21] (03CR) 10Gergő Tisza: [C: 03+2] Add Image: Harden API response parsing [extensions/GrowthExperiments] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/736519 (owner: 10Gergő Tisza) [23:25:39] (03CR) 10Gergő Tisza: [C: 03+2] Add Image: add HTTP proxy config [extensions/GrowthExperiments] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/736518 (https://phabricator.wikimedia.org/T290949) (owner: 10Gergő Tisza) [23:34:21] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1258.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:36:13] PROBLEM - MariaDB Replica Lag: s6 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1257.72 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:40:32] I am not able to access https://phabricator.wikimedia.org and https://gerrit.wikimedia.org, and I was accessing a few minutes before [23:41:10] Leave it to you now :) [23:41:25] (03CR) 10Cwhite: [C: 03+1] centrallog: prep rsync from centrallog2001 -> centrallog2002 [puppet] - 10https://gerrit.wikimedia.org/r/736563 (owner: 10Herron) [23:41:31] (03PS1) 10Dzahn: snapshot: convert 2 crons for full and partial dumps into systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/736599 (https://phabricator.wikimedia.org/T273673) [23:42:09] RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:42:15] RECOVERY - MariaDB Replica Lag: s6 on db2141 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:44:30] (03PS2) 10Dzahn: snapshot: convert 2 crons for full and partial dumps into systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/736599 (https://phabricator.wikimedia.org/T273673) [23:46:44] (03CR) 10Dzahn: snapshot: convert 2 crons for full and partial dumps into systemd timers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/736599 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [23:47:56] (03PS1) 10Dzahn: snapshop: remove absented cron code [puppet] - 10https://gerrit.wikimedia.org/r/736600 (https://phabricator.wikimedia.org/T273673) [23:48:05] (03Merged) 10jenkins-bot: Add Image: add HTTP proxy config [extensions/GrowthExperiments] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/736518 (https://phabricator.wikimedia.org/T290949) (owner: 10Gergő Tisza) [23:48:07] (03Merged) 10jenkins-bot: Add Image: Harden API response parsing [extensions/GrowthExperiments] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/736519 (owner: 10Gergő Tisza) [23:49:26] Any deployers available for the calendar? [23:50:41] Someone? [23:50:56] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/736520 [23:51:18] legoktm: ? [23:51:36] tgr is deploying right now [23:51:44] (03CR) 10Cwhite: P:rsyslog: ship puppetmaster logs to kafka (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/736233 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond) [23:52:03] Perfect! tgr: Let's start? [23:53:06] not much time left, but I don't think anything conflicting is happening afterwards? [23:53:39] Okay [23:54:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:33] (03CR) 10Gergő Tisza: [C: 03+2] Enable GrowthExperiments image recommendations on ar,bn,cs,vi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736593 (https://phabricator.wikimedia.org/T294878) (owner: 10Gergő Tisza) [23:56:21] (03Merged) 10jenkins-bot: Enable GrowthExperiments image recommendations on ar,bn,cs,vi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736593 (https://phabricator.wikimedia.org/T294878) (owner: 10Gergő Tisza) [23:57:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log