[00:00:00] !log legoktm@lists1001:~$ sudo rm -rf /etc/mailman # cleanup as part of 4869d91b0be / T282303 [00:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:06] T282303: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 [00:00:43] I go back to doing chores, ping me if needed [00:01:19] bye :) puppet is all happy now, thanks [00:05:48] PROBLEM - Check systemd state on elastic2048 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:48] PROBLEM - Check systemd state on an-tool1005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:13:13] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10RobH) [00:21:59] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10RobH) So I'm updating the firmware and I've applied puppet updates for the installer. However, the PXE flag needs to be shifted from the 1G to 10G port, which I've intentionally... [00:28:04] RECOVERY - Check systemd state on an-tool1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:48] RECOVERY - Check systemd state on elastic2048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:44] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:27:26] (03PS1) 10RLazarus: Cleanup: Replace all format() calls on string literals with f-strings. [software/spicerack] - 10https://gerrit.wikimedia.org/r/719389 [01:32:31] (03CR) 10jerkins-bot: [V: 04-1] Cleanup: Replace all format() calls on string literals with f-strings. [software/spicerack] - 10https://gerrit.wikimedia.org/r/719389 (owner: 10RLazarus) [01:48:18] PROBLEM - Check systemd state on lvs3005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:58:05] (03CR) 10RLazarus: Cleanup: Replace all format() calls on string literals with f-strings. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/719389 (owner: 10RLazarus) [02:15:18] RECOVERY - Check systemd state on lvs3005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:25:12] PROBLEM - Check systemd state on doh4001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:52:04] RECOVERY - Check systemd state on doh4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:29:53] (03PS1) 10Marostegui: db2094,db2095: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/719397 (https://phabricator.wikimedia.org/T288594) [04:34:54] (03CR) 10Marostegui: [C: 03+2] db2094,db2095: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/719397 (https://phabricator.wikimedia.org/T288594) (owner: 10Marostegui) [04:54:11] (03CR) 10Ryan Kemper: [C: 03+2] "This is awesome. The use of the rewrite rule to make this generalization seamless is brilliant" [puppet] - 10https://gerrit.wikimedia.org/r/714624 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [04:55:40] (03CR) 10Ryan Kemper: "Sorry for the delay in getting this shipped. Thanks for the great work on this Zabe!" [puppet] - 10https://gerrit.wikimedia.org/r/716563 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [04:55:44] (03CR) 10Ryan Kemper: [C: 03+2] query_service: remove absented query-service-gc-log-cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/716563 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [05:08:35] 10Puppet, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations: Puppet failure on integration-puppetmaster-02.integration.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T290422 (10hashar) Who knows? :] Thank you for the certificates regeneration! [05:14:54] (03PS4) 10Ryan Kemper: elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 [05:20:18] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (owner: 10Ryan Kemper) [05:29:44] (03PS5) 10Ryan Kemper: elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 [05:34:47] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (owner: 10Ryan Kemper) [05:50:48] (03PS1) 10Majavah: kubeadm: refresh version defaults for 1.19 [puppet] - 10https://gerrit.wikimedia.org/r/719400 [05:50:50] (03PS1) 10Majavah: aptrepo: drop k8s 1.18 updates [puppet] - 10https://gerrit.wikimedia.org/r/719401 [05:50:52] (03PS1) 10Majavah: aprepo: drop k8s 1.18 repo [puppet] - 10https://gerrit.wikimedia.org/r/719402 [05:51:14] (03PS2) 10Majavah: aptrepo: drop k8s 1.18 repo [puppet] - 10https://gerrit.wikimedia.org/r/719402 [05:57:50] (03PS1) 10Majavah: kubeadm::repo: use lsbdistcodename [puppet] - 10https://gerrit.wikimedia.org/r/719403 [06:15:45] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:16:07] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:27:58] (03PS6) 10Ryan Kemper: elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 [06:30:57] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: Send email on resolve for wikidata team [puppet] - 10https://gerrit.wikimedia.org/r/719380 (owner: 10Ladsgroup) [06:33:19] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (owner: 10Ryan Kemper) [06:38:25] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: remove check_grafana_alert [puppet] - 10https://gerrit.wikimedia.org/r/719107 (https://phabricator.wikimedia.org/T281359) (owner: 10Filippo Giunchedi) [06:40:42] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add ThanosSidecarUploadFailure to prometheus/ops [puppet] - 10https://gerrit.wikimedia.org/r/719126 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [06:40:49] (03PS2) 10Filippo Giunchedi: prometheus: add ThanosSidecarUploadFailure to prometheus/ops [puppet] - 10https://gerrit.wikimedia.org/r/719126 (https://phabricator.wikimedia.org/T288726) [06:40:56] (03PS1) 10Ryan Kemper: Revert "query_service: support multiple variants of wdqs microsite" [puppet] - 10https://gerrit.wikimedia.org/r/719185 [06:43:17] (03CR) 10Ryan Kemper: [C: 03+2] Revert "query_service: support multiple variants of wdqs microsite" [puppet] - 10https://gerrit.wikimedia.org/r/719185 (owner: 10Ryan Kemper) [06:43:41] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:44:05] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:45:22] !log [WDQS] Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/719185 to rollback query.wikidata.org changes [06:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:14] !log [WDQS] Manually running puppet-agent on `miscweb2002.codfw.wmnet,miscweb1002.eqiad.wmnet` [06:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:06] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: add udp receive errors for statsd [alerts] - 10https://gerrit.wikimedia.org/r/719123 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [06:48:13] (03CR) 10Filippo Giunchedi: [C: 03+2] statsd: remove statsd_udp_inbound_errors [puppet] - 10https://gerrit.wikimedia.org/r/719124 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [06:49:59] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:50:03] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:51:18] 10SRE, 10SRE-swift-storage: Put ms-be10[64-67] in service - https://phabricator.wikimedia.org/T290546 (10fgiunchedi) [06:51:31] 10SRE, 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be10[64-67] in service - https://phabricator.wikimedia.org/T290546 (10fgiunchedi) [06:54:15] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Create coolest-tool-academy mailing list for Coolest Tool Award - https://phabricator.wikimedia.org/T290511 (10Aklapper) Thank you Ladsgroup! Links for my colleagues: * Administration: https://lists.wikimedia.org/postorius/lists/coolest-tool-academy.lists.... [06:57:47] (03CR) 10Filippo Giunchedi: "I like the idea, however it seems a lot of (effectively) per-host metrics, what sort of insights are you looking for from the metrics? I'm" [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [06:59:26] (03CR) 10Muehlenhoff: puppetmaster: puppet prometheus reporting (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [07:00:48] (03PS7) 10Gehel: elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (owner: 10Ryan Kemper) [07:02:38] PROBLEM - Check systemd state on doh2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:02:46] 10SRE-swift-storage: Swift users and their usage - https://phabricator.wikimedia.org/T264291 (10fgiunchedi) Thank you for the note @jcrespo, definitely agreed object storage is far more suited for attachments. If OTRS has plans to migrate to moss then definitely we should be taking that into account for the next... [07:05:51] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (owner: 10Ryan Kemper) [07:08:04] RECOVERY - Check systemd state on doh2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:15:54] PROBLEM - Check systemd state on doh2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:18:41] (03CR) 10MMandere: [C: 03+2] varnish: Remove Vagrant test scripts [puppet] - 10https://gerrit.wikimedia.org/r/719236 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [07:23:02] 10SRE, 10Traffic, 10Patch-For-Review, 10good first task: Move Varnish test infrastructure from Vagrant to Docker - https://phabricator.wikimedia.org/T286639 (10MMandere) 05Open→03Resolved [07:27:33] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:45:25] !log start rollout of prometheus-rsyslog-exporter 0.0.0+git20201008-3 to eqsin/esams/ulsfo - T210137 [07:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:31] T210137: Handle unknown stats in rsyslog_exporter - https://phabricator.wikimedia.org/T210137 [07:49:51] RECOVERY - Check systemd state on doh2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:50:37] (03CR) 10Elukey: [C: 03+1] "I checked the flow multiple times and it seems clear, this refactoring is very nice and allows a lot more flexibility to bootstrap system " [deployment-charts] - 10https://gerrit.wikimedia.org/r/719295 (owner: 10JMeybohm) [07:51:57] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:52:31] (03CR) 10Elukey: [C: 03+1] admin_ng/main: Create istio-system namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719296 (owner: 10JMeybohm) [07:59:08] 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team (Radar): Requesting exec access to pods in 'ci' namespace staging kubernetes - https://phabricator.wikimedia.org/T290360 (10hashar) [08:02:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs: enforce a minimum spicerack version [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/713812 (owner: 10David Caro) [08:03:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: refresh version defaults for 1.19 [puppet] - 10https://gerrit.wikimedia.org/r/719400 (owner: 10Majavah) [08:04:33] (03PS2) 10Arturo Borrero Gonzalez: kubeadm::repo: use lsbdistcodename [puppet] - 10https://gerrit.wikimedia.org/r/719403 (owner: 10Majavah) [08:05:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm::repo: use lsbdistcodename [puppet] - 10https://gerrit.wikimedia.org/r/719403 (owner: 10Majavah) [08:07:19] (03PS1) 10JMeybohm: charts/secrets: chart type is not valid in apiVersion v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/719465 (https://phabricator.wikimedia.org/T289835) [08:07:58] (03PS2) 10JMeybohm: charts/secrets: chart type is not valid in apiVersion v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/719465 (https://phabricator.wikimedia.org/T289835) [08:10:55] (03PS2) 10Vgutierrez: haproxy: Allow using a custom systemd::service template [puppet] - 10https://gerrit.wikimedia.org/r/719282 (https://phabricator.wikimedia.org/T290005) [08:11:11] (03CR) 10JMeybohm: admin_ng/main: Create istio-system namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719296 (owner: 10JMeybohm) [08:11:28] (03CR) 10JMeybohm: [C: 03+2] charts/secrets: chart type is not valid in apiVersion v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/719465 (https://phabricator.wikimedia.org/T289835) (owner: 10JMeybohm) [08:12:08] (03CR) 10Vgutierrez: [C: 03+2] haproxy: Allow using a custom systemd::service template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719282 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [08:14:06] (03Merged) 10jenkins-bot: charts/secrets: chart type is not valid in apiVersion v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/719465 (https://phabricator.wikimedia.org/T289835) (owner: 10JMeybohm) [08:14:39] (03CR) 10Elukey: "Thank youuuuu" [deployment-charts] - 10https://gerrit.wikimedia.org/r/719465 (https://phabricator.wikimedia.org/T289835) (owner: 10JMeybohm) [08:15:34] (03PS1) 10Hashar: contint: do not backup /srv/docker [puppet] - 10https://gerrit.wikimedia.org/r/719466 (https://phabricator.wikimedia.org/T290437) [08:16:54] (03CR) 10Hashar: "I am 99% sure we dont care about images/containers stored on contint1001/contint2001 under /srv/docker. If we really care about them, they" [puppet] - 10https://gerrit.wikimedia.org/r/719466 (https://phabricator.wikimedia.org/T290437) (owner: 10Hashar) [08:25:39] (03CR) 10Kormat: [C: 03+2] Revert "mariadb: Set core sections to unidir replication." [puppet] - 10https://gerrit.wikimedia.org/r/719168 (owner: 10Marostegui) [08:30:43] (03PS3) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) [08:33:21] (03CR) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [08:53:20] (03PS1) 10Filippo Giunchedi: POC: override Cumin batch sleep+size from command line [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/719470 [08:54:26] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:35] (03CR) 10Jelto: [C: 03+2] "lgtm, I like the refactoring. I've done a diff in staging-eqiad and the only change is renaming of the RoleBinding psp-privileged to allow" [deployment-charts] - 10https://gerrit.wikimedia.org/r/719295 (owner: 10JMeybohm) [09:03:36] (03PS5) 10Vgutierrez: haproxy: Basic TLS terminator based on HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/715932 (https://phabricator.wikimedia.org/T290005) [09:03:38] (03PS4) 10Vgutierrez: haproxy: Allow configuring TLS options [puppet] - 10https://gerrit.wikimedia.org/r/716000 (https://phabricator.wikimedia.org/T290005) [09:03:40] (03PS6) 10Vgutierrez: haproxy: STEK support [puppet] - 10https://gerrit.wikimedia.org/r/716224 (https://phabricator.wikimedia.org/T290005) [09:03:42] (03PS1) 10Vgutierrez: cache::haproxy: Configure sslcert::ocsp [puppet] - 10https://gerrit.wikimedia.org/r/719471 (https://phabricator.wikimedia.org/T290005) [09:09:04] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:12] !log start rollout of prometheus-rsyslog-exporter 0.0.0+git20201008-3 to eqiad - T210137 [09:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:18] T210137: Handle unknown stats in rsyslog_exporter - https://phabricator.wikimedia.org/T210137 [09:10:31] (03PS11) 10Jbond: puppetmaster: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) [09:12:09] (03PS12) 10Jbond: puppetmaster: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) [09:13:01] (03CR) 10Jbond: [C: 04-1] "going to -1 this for now in favour of https://gerrit.wikimedia.org/r/c/operations/puppet/+/719368" [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [09:13:10] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:34] PROBLEM - Check systemd state on maps1006 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:38] PROBLEM - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:44] PROBLEM - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:20:40] PROBLEM - Check systemd state on maps1008 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:02] (03CR) 10Jbond: "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/719470 (owner: 10Filippo Giunchedi) [09:26:26] (03CR) 10Kormat: [C: 03+1] "This looks great, thanks :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [09:27:00] (03CR) 10Volans: POC: override Cumin batch sleep+size from command line (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/719470 (owner: 10Filippo Giunchedi) [09:27:38] (03PS1) 10Jbond: 0.1.2: prepare release [software/statograph] - 10https://gerrit.wikimedia.org/r/719473 [09:28:25] (03CR) 10Jcrespo: [C: 03+1] "+1 on the syntax, cannot speak of the logic, but this would explain the large increase in metadata." [puppet] - 10https://gerrit.wikimedia.org/r/719466 (https://phabricator.wikimedia.org/T290437) (owner: 10Hashar) [09:28:27] (03PS2) 10Jbond: 0.1.2: prepare release [software/statograph] - 10https://gerrit.wikimedia.org/r/719473 (https://phabricator.wikimedia.org/T290425) [09:28:56] (03PS1) 10Vgutierrez: haproxy: Accept systemd unit content instead of template path [puppet] - 10https://gerrit.wikimedia.org/r/719474 (https://phabricator.wikimedia.org/T290005) [09:29:25] (03CR) 10jerkins-bot: [V: 04-1] haproxy: Accept systemd unit content instead of template path [puppet] - 10https://gerrit.wikimedia.org/r/719474 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:29:37] (03PS2) 10Vgutierrez: haproxy: Accept systemd unit content instead of template path [puppet] - 10https://gerrit.wikimedia.org/r/719474 (https://phabricator.wikimedia.org/T290005) [09:29:39] !log start rollout of prometheus-rsyslog-exporter 0.0.0+git20201008-3 to codfw - T210137 [09:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:43] T210137: Handle unknown stats in rsyslog_exporter - https://phabricator.wikimedia.org/T210137 [09:32:00] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31026/console" [puppet] - 10https://gerrit.wikimedia.org/r/719474 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:34:05] (03CR) 10Jbond: [C: 03+2] 0.1.2: prepare release [software/statograph] - 10https://gerrit.wikimedia.org/r/719473 (https://phabricator.wikimedia.org/T290425) (owner: 10Jbond) [09:34:07] (03PS1) 10JMeybohm: toolhub: chart type is not valid in apiVersion v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/719475 [09:34:25] PROBLEM - Check systemd state on maps2010 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:58] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:58] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:35:02] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:38:21] !log start rollout of prometheus-rsyslog-exporter 0.0.0+git20201008-3 to wikimedia.org - T210137 [09:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:25] T210137: Handle unknown stats in rsyslog_exporter - https://phabricator.wikimedia.org/T210137 [09:40:25] (03PS2) 10Filippo Giunchedi: Override Cumin batch sleep+size from command line [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/719470 [09:40:55] (03CR) 10Filippo Giunchedi: Override Cumin batch sleep+size from command line (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/719470 (owner: 10Filippo Giunchedi) [09:42:18] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] haproxy: Accept systemd unit content instead of template path [puppet] - 10https://gerrit.wikimedia.org/r/719474 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:46:04] (03PS6) 10Vgutierrez: haproxy: Basic TLS terminator based on HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/715932 (https://phabricator.wikimedia.org/T290005) [09:46:06] (03PS5) 10Vgutierrez: haproxy: Allow configuring TLS options [puppet] - 10https://gerrit.wikimedia.org/r/716000 (https://phabricator.wikimedia.org/T290005) [09:46:08] (03PS7) 10Vgutierrez: haproxy: STEK support [puppet] - 10https://gerrit.wikimedia.org/r/716224 (https://phabricator.wikimedia.org/T290005) [09:46:10] (03PS2) 10Vgutierrez: cache::haproxy: Configure sslcert::ocsp [puppet] - 10https://gerrit.wikimedia.org/r/719471 (https://phabricator.wikimedia.org/T290005) [09:47:42] (03PS2) 10Jbond: puppetmaster: drop log messages from logstash reporter [puppet] - 10https://gerrit.wikimedia.org/r/719368 [09:48:00] (03PS2) 10Jbond: P:puppetmaster::common: Add back logstash support [puppet] - 10https://gerrit.wikimedia.org/r/719372 [09:53:27] (03CR) 10Kormat: [C: 04-1] sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [09:55:07] (03PS3) 10Jbond: P:puppetmaster::common: Add back logstash support [puppet] - 10https://gerrit.wikimedia.org/r/719372 (https://phabricator.wikimedia.org/T222826) [09:57:03] (03PS4) 10Jbond: P:puppetmaster::common: Add back logstash support [puppet] - 10https://gerrit.wikimedia.org/r/719372 (https://phabricator.wikimedia.org/T222826) [09:57:56] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:00:20] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:02:13] (03PS5) 10Jbond: P:puppetmaster::common: Add back logstash support [puppet] - 10https://gerrit.wikimedia.org/r/719372 (https://phabricator.wikimedia.org/T222826) [10:03:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31029/console" [puppet] - 10https://gerrit.wikimedia.org/r/719372 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond) [10:03:21] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2001.wikimedia.org with reason: upgrade gitlab2001 to new version https://phabricator.wikmiedia.org/T289802 [10:03:23] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2001.wikimedia.org with reason: upgrade gitlab2001 to new version https://phabricator.wikmiedia.org/T289802 [10:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:52] !log upgrade gitlab2001 to gitlab-ce=14.0.10-ce.0 [10:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:12:30] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:42] RECOVERY - Check systemd state on maps1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:44] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:44] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:48] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:48] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:00] RECOVERY - Check systemd state on maps2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:14] RECOVERY - Check systemd state on maps1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:16] RECOVERY - Check systemd state on maps1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:24] RECOVERY - Check systemd state on maps1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:19:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, but I'm not sure how this could affect cloud use cases." [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) (owner: 10Jbond) [10:24:21] (03PS1) 10Muehlenhoff: Add Hiera option to enable Ganeti 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) [10:24:56] (03CR) 10jerkins-bot: [V: 04-1] Add Hiera option to enable Ganeti 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [10:25:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:31:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:31:21] (03PS2) 10Muehlenhoff: Add Hiera option to enable Ganeti 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) [10:31:30] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:31:58] (03CR) 10jerkins-bot: [V: 04-1] Add Hiera option to enable Ganeti 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [10:32:50] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:36:18] (03PS3) 10Muehlenhoff: Add Hiera option to enable Ganeti 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) [10:40:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [10:46:26] (03CR) 10Volans: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [10:52:28] (03PS1) 10Vgutierrez: haproxy: Allow configuring timeouts [puppet] - 10https://gerrit.wikimedia.org/r/719479 (https://phabricator.wikimedia.org/T290005) [10:55:11] (03PS2) 10Vgutierrez: haproxy: Allow configuring timeouts [puppet] - 10https://gerrit.wikimedia.org/r/719479 (https://phabricator.wikimedia.org/T290005) [10:57:07] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [10:57:37] (03PS4) 10Muehlenhoff: Add Hiera option to enable Ganeti 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210908T1100). [11:00:05] No Gerrit patches in the queue for this window AFAICS. [11:00:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. I'll test this on cumin2002 and will merge if all is working fine." [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/719470 (owner: 10Filippo Giunchedi) [11:01:34] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [11:01:50] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1005.eqiad.wmnet [11:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:59] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on maps1005.eqiad.wmnet with reason: Resyncing from master [11:02:01] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on maps1005.eqiad.wmnet with reason: Resyncing from master [11:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:05] !log upload statograph_0.1.2 [11:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:50] PROBLEM - puppet last run on sretest1001 is CRITICAL: CRITICAL: Puppet has been disabled for 605106 seconds, message: testing custom network fact - jbond, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:11:08] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add Hiera option to enable Ganeti 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [11:11:23] will fix that ^^ [11:14:49] 10SRE, 10SRE-OnFire, 10observability, 10User-jbond: statograph_post service fail on alert hosts - https://phabricator.wikimedia.org/T290425 (10jbond) 05Open→03Resolved a:03cmooney I have deployed @cmooney fix will resolve [11:14:51] 10SRE, 10SRE-OnFire, 10observability, 10Patch-For-Review, 10User-jbond: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display - https://phabricator.wikimedia.org/T285569 (10jbond) [11:16:44] RECOVERY - puppet last run on sretest1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:18:17] (03CR) 10Jbond: [V: 03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) (owner: 10Jbond) [11:23:59] (03PS2) 10Bartosz Dziewoński: DiscussionTools: Make 'newtopictool' available to everyone on arwiki and cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710076 (https://phabricator.wikimedia.org/T285724) [11:32:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:32:54] (03PS5) 10Muehlenhoff: Add Hiera option to enable Ganeti 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) [11:34:22] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:34:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:34:56] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [11:35:17] 10SRE, 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog (Kanban): tegola-vector-tiles load testing and Swift throughput experiments - https://phabricator.wikimedia.org/T284440 (10Jgiannelos) [11:36:14] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:39:31] (03CR) 10Ladsgroup: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/719380 (owner: 10Ladsgroup) [11:40:57] (03PS1) 10Ladsgroup: mailman: Remove absented file definitions [puppet] - 10https://gerrit.wikimedia.org/r/719484 (https://phabricator.wikimedia.org/T282303) [11:43:29] (03PS5) 10Hnowlan: maps: standardise the maps2.0 config in codfw, remove old nodes [puppet] - 10https://gerrit.wikimedia.org/r/702687 (https://phabricator.wikimedia.org/T269582) [11:44:24] (03Abandoned) 10Elukey: kubeflow-kfserving-inference: add Secret specs for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/715747 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [11:45:18] (03PS6) 10Muehlenhoff: Add Hiera option to enable Ganeti 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) [11:46:50] (03CR) 10Hnowlan: [C: 03+2] maps: standardise the maps2.0 config in codfw, remove old nodes [puppet] - 10https://gerrit.wikimedia.org/r/702687 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [11:48:03] (03CR) 10Elukey: [C: 03+1] sre.hosts.decommission: use asset tag if mgmt fails [cookbooks] - 10https://gerrit.wikimedia.org/r/719135 (owner: 10Volans) [11:50:16] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31030/console" [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [11:56:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [11:57:11] !log installing curl security updates on stretch [11:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:30] Amir1: you're welcome re: send_resolved: yes, easy enough :) thank you for sending the patch ready to be merged [12:03:57] my pleasure. I got another feature request that I need to dig [12:04:17] "be able to see old alerts. So far I haven't where to extend the time range about what alerts to show, or see the metrics which feed into this system" [12:06:53] makes sense re: history, we have it in logstash [12:07:09] I'll add docs/links to the wikitech page [12:11:12] Thanks! [12:14:29] {{done}} [12:14:44] (03CR) 10Muehlenhoff: [C: 03+2] Add Hiera option to enable Ganeti 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [12:15:16] Amir1: please send tasks our (o11y) way for feature requests too and happy to discuss [12:15:55] Sure, so far that was it. I let you know if there are more [12:16:16] sweet [12:19:48] (03CR) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [12:22:48] (03PS2) 10Muehlenhoff: profile::tlsproxy::instance: Default to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/708969 (https://phabricator.wikimedia.org/T164456) [12:27:45] (03PS3) 10Jbond: puppetmaster: drop log messages from logstash reporter [puppet] - 10https://gerrit.wikimedia.org/r/719368 [12:29:33] (03PS2) 10Volans: icinga: Add services_downtimed context manager [software/spicerack] - 10https://gerrit.wikimedia.org/r/719356 (owner: 10RLazarus) [12:30:01] (03PS3) 10Volans: icinga: Add services_downtimed context manager [software/spicerack] - 10https://gerrit.wikimedia.org/r/719356 (owner: 10RLazarus) [12:30:17] (03CR) 10Volans: "I've took the liberty to fix the 2 nits" [software/spicerack] - 10https://gerrit.wikimedia.org/r/719356 (owner: 10RLazarus) [12:30:23] (03PS4) 10Jbond: puppetmaster: drop log messages from logstash reporter [puppet] - 10https://gerrit.wikimedia.org/r/719368 [12:30:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/708969 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [12:31:15] 10SRE, 10Datacenter-Switchover, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Marostegui) [12:31:17] (03CR) 10Jbond: "see https://phabricator.wikimedia.org/P17252 for an example of the json blob that would be sent" [puppet] - 10https://gerrit.wikimedia.org/r/719368 (owner: 10Jbond) [12:33:28] (03PS5) 10Jbond: puppetmaster: drop log messages from logstash reporter [puppet] - 10https://gerrit.wikimedia.org/r/719368 [12:36:28] PROBLEM - Check systemd state on maps2004 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:31] (03PS6) 10Jbond: puppetmaster: drop log messages from logstash reporter [puppet] - 10https://gerrit.wikimedia.org/r/719368 (https://phabricator.wikimedia.org/T222826) [12:38:01] (03CR) 10Volans: [C: 03+2] icinga: Add services_downtimed context manager [software/spicerack] - 10https://gerrit.wikimedia.org/r/719356 (owner: 10RLazarus) [12:43:04] (03Merged) 10jenkins-bot: icinga: Add services_downtimed context manager [software/spicerack] - 10https://gerrit.wikimedia.org/r/719356 (owner: 10RLazarus) [12:45:07] !log gitlab: pausing all runners in preparation for upgrade to 14.0.10 (T289802) [12:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:14] T289802: GitLab major version upgrade: 14.x - https://phabricator.wikimedia.org/T289802 [12:49:34] (03PS1) 10Muehlenhoff: Update repository hook for Gitlab 14.1 [puppet] - 10https://gerrit.wikimedia.org/r/719512 (https://phabricator.wikimedia.org/T289802) [12:50:31] (03PS1) 10Jelto: aptrepo::files::updates Update repository hook for gitlab-runner 14 [puppet] - 10https://gerrit.wikimedia.org/r/719513 (https://phabricator.wikimedia.org/T289802) [12:52:46] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/719513 (https://phabricator.wikimedia.org/T289802) (owner: 10Jelto) [12:54:19] (03CR) 10Brennen Bearnes: "Should we just do the runner upgrade directly to final version, and keep them paused while we upgrade intermediary versions of GitLab itse" [puppet] - 10https://gerrit.wikimedia.org/r/719513 (https://phabricator.wikimedia.org/T289802) (owner: 10Jelto) [12:57:25] (03PS3) 10Ayounsi: JSON schema, add coverage to secrets [homer/public] - 10https://gerrit.wikimedia.org/r/674318 (https://phabricator.wikimedia.org/T272688) [12:59:13] (03CR) 10Jelto: aptrepo::files::updates Update repository hook for gitlab-runner 14 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719513 (https://phabricator.wikimedia.org/T289802) (owner: 10Jelto) [12:59:39] (03CR) 10Brennen Bearnes: aptrepo::files::updates Update repository hook for gitlab-runner 14 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719513 (https://phabricator.wikimedia.org/T289802) (owner: 10Jelto) [12:59:42] (03PS4) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) [12:59:44] (03PS1) 10Elukey: kubeflow-kfserving-inference: avoid repetitions with multi-models [deployment-charts] - 10https://gerrit.wikimedia.org/r/719515 (https://phabricator.wikimedia.org/T286791) [13:01:23] (03CR) 10Brennen Bearnes: aptrepo::files::updates Update repository hook for gitlab-runner 14 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719513 (https://phabricator.wikimedia.org/T289802) (owner: 10Jelto) [13:03:18] (03PS2) 10Jelto: aptrepo::files::updates Update repository hook for gitlab-runner 14 [puppet] - 10https://gerrit.wikimedia.org/r/719513 (https://phabricator.wikimedia.org/T289802) [13:03:50] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:04:12] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1616 and 3116 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:04:24] (03CR) 10Filippo Giunchedi: "Thank you for the sample JSON, I'll let Cole comment on that but re: the mechanics it is sufficient to log json to local syslog (and allow" [puppet] - 10https://gerrit.wikimedia.org/r/719368 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond) [13:06:18] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:07:42] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:08:16] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:12:10] (03CR) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [13:13:17] !log gitlab1001: downtiming alerts for 2.5 hours; upgrading to 14.0.10 (T289802) [13:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:22] T289802: GitLab major version upgrade: 14.x - https://phabricator.wikimedia.org/T289802 [13:14:07] (03CR) 10Muehlenhoff: [C: 03+1] "Makes sense!" [puppet] - 10https://gerrit.wikimedia.org/r/719513 (https://phabricator.wikimedia.org/T289802) (owner: 10Jelto) [13:17:34] PROBLEM - Query Service HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [13:19:30] RECOVERY - Query Service HTTP Port on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [13:22:08] (03CR) 10Jelto: [C: 03+1] Update repository hook for Gitlab 14.1 [puppet] - 10https://gerrit.wikimedia.org/r/719512 (https://phabricator.wikimedia.org/T289802) (owner: 10Muehlenhoff) [13:23:44] (03CR) 10Jelto: [C: 03+2] Update repository hook for Gitlab 14.1 [puppet] - 10https://gerrit.wikimedia.org/r/719512 (https://phabricator.wikimedia.org/T289802) (owner: 10Muehlenhoff) [13:24:10] (03CR) 10Jelto: [C: 03+2] aptrepo::files::updates Update repository hook for gitlab-runner 14 [puppet] - 10https://gerrit.wikimedia.org/r/719513 (https://phabricator.wikimedia.org/T289802) (owner: 10Jelto) [13:24:35] (03PS3) 10Jelto: aptrepo::files::updates Update repository hook for gitlab-runner 14 [puppet] - 10https://gerrit.wikimedia.org/r/719513 (https://phabricator.wikimedia.org/T289802) [13:26:46] (03CR) 10Vgutierrez: [C: 03+1] profile::tlsproxy::instance: Default to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/708969 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [13:29:25] (03PS1) 10Jbond: puppetdb: log "long" autovacuum tasks [puppet] - 10https://gerrit.wikimedia.org/r/719518 (https://phabricator.wikimedia.org/T263578) [13:30:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31031/console" [puppet] - 10https://gerrit.wikimedia.org/r/719518 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [13:33:18] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb: log "long" autovacuum tasks [puppet] - 10https://gerrit.wikimedia.org/r/719518 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [13:38:49] !log gitlab: upgrading gitlab2001, followed by gitlab1001, to 14.1.5 (T289802) [13:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:54] T289802: GitLab major version upgrade: 14.x - https://phabricator.wikimedia.org/T289802 [13:39:09] (03PS1) 10Jelto: aptrepo::files::updates Update repository hook for Gitlab 14.2 [puppet] - 10https://gerrit.wikimedia.org/r/719519 (https://phabricator.wikimedia.org/T289802) [13:40:18] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/719519 (https://phabricator.wikimedia.org/T289802) (owner: 10Jelto) [13:44:10] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@eb211ac]: kartotherian: restore v4 maxzoom to z15 [13:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:48:46] (03PS1) 10Elukey: Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) [13:50:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:50:12] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) After doing the work to compact this again the database as a whole is most tables have a row count of either equal to the number of hosts... [13:50:52] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@eb211ac]: kartotherian: restore v4 maxzoom to z15 (duration: 06m 42s) [13:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:11] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [13:55:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:57:05] (03CR) 10Herron: [C: 03+1] rsyslog: stop saving trafficserver logs to disk [puppet] - 10https://gerrit.wikimedia.org/r/719052 (https://phabricator.wikimedia.org/T290305) (owner: 10Ema) [13:57:41] (03CR) 10Brennen Bearnes: [C: 03+1] aptrepo::files::updates Update repository hook for Gitlab 14.2 [puppet] - 10https://gerrit.wikimedia.org/r/719519 (https://phabricator.wikimedia.org/T289802) (owner: 10Jelto) [13:57:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:59:10] (03PS2) 10Elukey: Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) [13:59:31] (03CR) 10Jelto: [C: 03+2] aptrepo::files::updates Update repository hook for Gitlab 14.2 [puppet] - 10https://gerrit.wikimedia.org/r/719519 (https://phabricator.wikimedia.org/T289802) (owner: 10Jelto) [14:01:02] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [14:02:27] (03PS1) 10MMandere: puppetmaster: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/719523 (https://phabricator.wikimedia.org/T282787) [14:04:04] (03PS2) 10Elukey: kubeflow-kfserving-inference: avoid repetitions with multi-models [deployment-charts] - 10https://gerrit.wikimedia.org/r/719515 (https://phabricator.wikimedia.org/T286791) [14:04:06] (03PS5) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) [14:04:08] (03PS3) 10Elukey: Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) [14:05:00] (03PS1) 10Ladsgroup: Turn off jQuery migrate on wikisource wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719524 (https://phabricator.wikimedia.org/T280944) [14:05:48] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [14:08:30] (03PS3) 10Elukey: kubeflow-kfserving-inference: avoid repetitions with multi-models [deployment-charts] - 10https://gerrit.wikimedia.org/r/719515 (https://phabricator.wikimedia.org/T286791) [14:08:32] (03PS6) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) [14:08:34] (03PS4) 10Elukey: Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) [14:10:26] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [14:13:27] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/719523 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [14:15:04] (03CR) 10MMandere: [C: 03+2] puppetmaster: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/719523 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [14:17:54] (03CR) 10Filippo Giunchedi: profile: restart postgres on first install / bootstrap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705704 (owner: 10Filippo Giunchedi) [14:25:05] (03PS1) 10MMandere: prometheus: Add drmrs DC Site [puppet] - 10https://gerrit.wikimedia.org/r/719526 (https://phabricator.wikimedia.org/T282787) [14:28:57] (03CR) 10Jbond: profile: restart postgres on first install / bootstrap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705704 (owner: 10Filippo Giunchedi) [14:33:58] !log installing zeromq3 security updates [14:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:08] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ms-be1067.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2... [14:37:43] 10SRE, 10ops-codfw, 10DC-Ops: Netbox Errors in codfw - https://phabricator.wikimedia.org/T290362 (10Papaul) 05Open→03Resolved Complete [14:40:13] (03PS2) 10Herron: slo_dashboards: add cluster_label_query and set default [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/717587 (https://phabricator.wikimedia.org/T289036) [14:42:56] (03CR) 10Ahmon Dancy: [C: 03+1] contint: do not backup /srv/docker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719466 (https://phabricator.wikimedia.org/T290437) (owner: 10Hashar) [14:47:21] (03CR) 10Dzahn: [C: 03+1] "git clone from the new place works for me" [puppet] - 10https://gerrit.wikimedia.org/r/719363 (https://phabricator.wikimedia.org/T290259) (owner: 10Brennen Bearnes) [14:49:43] (03PS1) 10Alexandros Kosiaris: Update jabram's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/719530 (https://phabricator.wikimedia.org/T209433) [14:49:52] 10SRE, 10SRE-Access-Requests: Replace JAbrams' old ssh public key with a new one - https://phabricator.wikimedia.org/T290433 (10akosiaris) p:05Triage→03Medium [14:51:20] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission pc2008.codfw.wmnet - https://phabricator.wikimedia.org/T289115 (10Papaul) [14:51:36] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1067.eqiad.wmnet with reason: REIMAGE [14:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:40] (03CR) 10Dzahn: [C: 03+2] contint: do not backup /srv/docker [puppet] - 10https://gerrit.wikimedia.org/r/719466 (https://phabricator.wikimedia.org/T290437) (owner: 10Hashar) [14:51:53] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2009.codfw.wmnet - https://phabricator.wikimedia.org/T289116 (10Papaul) [14:51:57] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission pc2008.codfw.wmnet - https://phabricator.wikimedia.org/T289115 (10Papaul) 05Open→03Resolved complete [14:52:25] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2009.codfw.wmnet - https://phabricator.wikimedia.org/T289116 (10Papaul) 05Open→03Resolved complete [14:53:00] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2010.codfw.wmnet - https://phabricator.wikimedia.org/T289117 (10Papaul) [14:53:42] mmandere: hi, we got a puppetmaster merge conflict, I don't mean to rush you though [14:53:46] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be1067.eqiad.wmnet with reason: REIMAGE [14:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:08] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2010.codfw.wmnet - https://phabricator.wikimedia.org/T289117 (10Papaul) 05Open→03Resolved complete [14:54:26] !log gitlab: upgrading gitlab2001, followed by gitlab1001, to 14.2.3 (T289802) [14:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:31] T289802: GitLab major version upgrade: 14.x - https://phabricator.wikimedia.org/T289802 [14:56:13] 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team (Radar): Requesting exec access to pods in 'ci' namespace staging kubernetes - https://phabricator.wikimedia.org/T290360 (10akosiaris) p:05Triage→03Medium Hi @dancy As in personal level access? We don't have user level accounts, so it would be so... [14:57:20] !log installing 4.19.194 kernels on stretch systems with 4.19.x (no reboots yet) [14:57:22] (03PS26) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) [14:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:36] !log Retroactive: started to warm up eqiad databaes [14:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:01] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Chmielko Maslak - https://phabricator.wikimedia.org/T289257 (10akosiaris) >>! In T289257#7327704, @fgiunchedi wrote: > @chmielkomaslak access has been set up, please confirm the following: >... [14:59:04] jbond: it's safe to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/719523 on the master, right? [14:59:19] "just" domain_search but on puppet masters [14:59:29] (03CR) 10Herron: "please see updated preview at https://grafana.wikimedia.org/dashboard/snapshot/yNhvI9nsW7T9O4d09qpnq0rCVr1IjrpQ" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/717587 (https://phabricator.wikimedia.org/T289036) (owner: 10Herron) [14:59:31] (03PS3) 10Herron: slo_dashboards: add cluster_label_query and set default [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/717587 (https://phabricator.wikimedia.org/T289036) [15:00:18] (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) (owner: 10Jbond) [15:01:10] 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team (Radar): Requesting exec access to pods in 'ci' namespace staging kubernetes - https://phabricator.wikimedia.org/T290360 (10dancy) >>! In T290360#7339450, @akosiaris wrote: > Hi @dancy > > As in personal level access? We don't have user level account... [15:01:39] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be1067.eqiad.wmnet'] ` and were **ALL** successful. [15:02:36] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10RobH) [15:02:49] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:02:55] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10RobH) 05Open→03Resolved all hosts installed and staged [15:03:38] alright, I'll merge both changes [15:04:01] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:04:59] (03CR) 10Dzahn: [C: 03+1] "it's a wrapper for /usr/local/bin/safe-service-restart" [puppet] - 10https://gerrit.wikimedia.org/r/719307 (https://phabricator.wikimedia.org/T290038) (owner: 10Ahmon Dancy) [15:06:39] (03CR) 10Dzahn: "this wasn't merged yet on the puppetmaster, I did so just now" [puppet] - 10https://gerrit.wikimedia.org/r/719523 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [15:06:46] (03PS40) 10Btullis: Install Alluxio to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) [15:08:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gitlab,rails,redis_gitlab,sidekiq} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:08:25] (03PS1) 10Filippo Giunchedi: swift: add ms-be10[64-67] [puppet] - 10https://gerrit.wikimedia.org/r/719532 (https://phabricator.wikimedia.org/T290546) [15:09:27] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: add ms-be10[64-67] [puppet] - 10https://gerrit.wikimedia.org/r/719532 (https://phabricator.wikimedia.org/T290546) (owner: 10Filippo Giunchedi) [15:10:44] (03CR) 10Dzahn: swift: convert dispersion stats cron to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [15:11:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:16:17] (03PS4) 10Dzahn: swift: convert dispersion stats cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) [15:16:59] (03CR) 10Dzahn: swift: convert dispersion stats cron to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [15:17:11] 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Mew Ophaswongse - https://phabricator.wikimedia.org/T290200 (10akosiaris) @DMburugu Hi! Your approval is required on this task. [15:19:00] (03CR) 10Dzahn: [C: 03+1] P:puppet: Add alerting for large files in client bucket [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) (owner: 10Jbond) [15:19:17] 10SRE, 10SRE-Access-Requests, 10Analytics: Requesting access to analytics-privatedata-users group for Abban Dunne - https://phabricator.wikimedia.org/T289775 (10akosiaris) @odimitrijevic Hi! Your approval is required on this task. [15:21:07] (03PS2) 10Alexandros Kosiaris: Add the ability to generate comparisions of latency percentiles [software/benchmw] - 10https://gerrit.wikimedia.org/r/719105 [15:21:30] (03CR) 10Dzahn: [C: 03+1] "looks good to me, if there is any concern here it is really just that alert1001 does not get overloaded with too many checks" [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) (owner: 10Jbond) [15:21:55] (03PS2) 10Alexandros Kosiaris: Update jabram's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/719530 (https://phabricator.wikimedia.org/T209433) [15:23:12] 10SRE, 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be10[64-67] in service - https://phabricator.wikimedia.org/T290546 (10akosiaris) p:05Triage→03Medium [15:24:20] (03CR) 10Dzahn: "I think you got the wrong ticket link?" [puppet] - 10https://gerrit.wikimedia.org/r/719530 (https://phabricator.wikimedia.org/T209433) (owner: 10Alexandros Kosiaris) [15:26:21] (03PS1) 10Alexandros Kosiaris: Remove user greta from admin/ [puppet] - 10https://gerrit.wikimedia.org/r/719536 (https://phabricator.wikimedia.org/T290423) [15:26:31] (03CR) 10Jbond: "updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) (owner: 10Jbond) [15:26:49] (03PS27) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) [15:27:32] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Absent greta ldap account - https://phabricator.wikimedia.org/T290423 (10akosiaris) p:05Triage→03High Removed user from the nda group. I 'll merge the puppet change as well. [15:28:09] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove user greta from admin/ [puppet] - 10https://gerrit.wikimedia.org/r/719536 (https://phabricator.wikimedia.org/T290423) (owner: 10Alexandros Kosiaris) [15:29:16] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Absent greta ldap account - https://phabricator.wikimedia.org/T290423 (10akosiaris) 05Open→03Resolved a:03akosiaris Resolving. Change merged, user absent in puppet's admin/ module. [15:29:39] (03CR) 10Dzahn: "also see https://wikitech.wikimedia.org/wiki/SRE_Offboarding#Check_Users_LDAP_access and it might need to go into the special group for ab" [puppet] - 10https://gerrit.wikimedia.org/r/719536 (https://phabricator.wikimedia.org/T290423) (owner: 10Alexandros Kosiaris) [15:29:48] (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) (owner: 10Jbond) [15:31:17] (03CR) 10RLazarus: icinga: Add services_downtimed context manager (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/719356 (owner: 10RLazarus) [15:32:05] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-8), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10akosiaris) @ArielGlenn Thanks for taking over this. Let us know if you need any help! [15:34:05] (03PS1) 10Ssingh: durum: switch to client-side UUID generation [puppet] - 10https://gerrit.wikimedia.org/r/719538 (https://phabricator.wikimedia.org/T289536) [15:35:08] (03PS3) 10JMeybohm: Rakefile: Add task validate_istio_config [deployment-charts] - 10https://gerrit.wikimedia.org/r/719272 [15:35:10] (03PS6) 10JMeybohm: custom_deploy: Add istio manifest for main clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/719265 (https://phabricator.wikimedia.org/T287007) [15:35:12] (03PS4) 10JMeybohm: admin_ng: Support managing of system namespaces with helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/719295 [15:35:14] (03PS5) 10JMeybohm: admin_ng/main: Create istio-system namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/719296 [15:35:16] (03PS1) 10JMeybohm: Rakefile: Remove check_docker. It's already in utils.rb [deployment-charts] - 10https://gerrit.wikimedia.org/r/719539 [15:35:18] (03PS1) 10JMeybohm: Rakefile: Add tasks to test and diff admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/719540 [15:37:31] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1064.eqiad.wmnet [15:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:48] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1066.eqiad.wmnet [15:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:55] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1065.eqiad.wmnet [15:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:04] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1067.eqiad.wmnet [15:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:08] 10SRE, 10serviceops: Pods in evicted state for various namespaces in k8s main - https://phabricator.wikimedia.org/T290444 (10akosiaris) For what is worth, evictions are not a bad thing per se in kubernetes. They can happen for a variety of reasons, notably: * `DiskPressure` -- Usable disk is running out on th... [15:39:10] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: Add tasks to test and diff admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/719540 (owner: 10JMeybohm) [15:39:41] (03PS2) 10Ssingh: durum: switch to client-side UUID generation [puppet] - 10https://gerrit.wikimedia.org/r/719538 (https://phabricator.wikimedia.org/T289536) [15:40:55] (03CR) 10Jdlrobson: Italian Wikipedia is now a group 1 wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson) [15:41:21] (03CR) 10Elukey: "The change is very big :) I left some comments but overall it looks very good. I am a bit on the fence for the ssh keys parts, I have neve" [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [15:41:41] 10SRE, 10serviceops: Pods in evicted state for various namespaces in k8s main - https://phabricator.wikimedia.org/T290444 (10akosiaris) 05Open→03Resolved a:03akosiaris Per the above the answer to `Is it normal that pods are in this state? If not, let's investigate and then add an alarm :)` is "Mostly... [15:41:58] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1064.eqiad.wmnet [15:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:00] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1066.eqiad.wmnet [15:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:13] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1065.eqiad.wmnet [15:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:22] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1067.eqiad.wmnet [15:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:22] (03CR) 10Cwhite: "Looking good! I like how this prevents leaking sensitive data to logstash." [puppet] - 10https://gerrit.wikimedia.org/r/719368 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond) [15:46:14] (03CR) 10Elukey: "Question about dfs.permissions.superusergroup - is the alluxio user going" [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [15:47:59] (03PS1) 10Dzahn: thumbor: convert systemd-clean-tmpfiles cron to timer [puppet] - 10https://gerrit.wikimedia.org/r/719542 (https://phabricator.wikimedia.org/T273673) [15:51:39] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31035/console" [puppet] - 10https://gerrit.wikimedia.org/r/719538 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [15:52:19] (03CR) 10Elukey: Install Alluxio to the test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [15:53:18] (03CR) 10Elukey: [C: 03+1] Rakefile: Remove check_docker. It's already in utils.rb [deployment-charts] - 10https://gerrit.wikimedia.org/r/719539 (owner: 10JMeybohm) [15:54:24] (03PS1) 10Dzahn: thumbor: convert generate-thumbor-age-metrics to timer [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) [15:55:14] (03CR) 10Zabe: swift: convert dispersion stats cron to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [15:57:01] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) Some additional RBAC requirements: on `releases1002` and `releases2002` helm is used as well. So when migrating, we have to make sure that the [user](https://gerrit.wikimedia.org... [16:00:09] (03PS2) 10Milimetric: analytics/data_purge: Finish renaming geoeditors_daily to editors_daily [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093) [16:00:31] (03CR) 10Milimetric: analytics/data_purge: Finish renaming geoeditors_daily to editors_daily (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093) (owner: 10Milimetric) [16:00:51] (03CR) 10jerkins-bot: [V: 04-1] analytics/data_purge: Finish renaming geoeditors_daily to editors_daily [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093) (owner: 10Milimetric) [16:01:01] PROBLEM - Check systemd state on cp5014 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:47] (03PS5) 10Dzahn: swift: convert dispersion stats cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) [16:02:39] (03CR) 10Dzahn: swift: convert dispersion stats cron to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [16:04:20] (03PS2) 10Dzahn: thumbor: convert generate-thumbor-age-metrics to timer [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) [16:07:50] (03PS3) 10Ssingh: durum: switch to client-side UUID generation [puppet] - 10https://gerrit.wikimedia.org/r/719538 (https://phabricator.wikimedia.org/T289536) [16:09:10] (03CR) 10Ladsgroup: swift: convert dispersion stats cron to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [16:09:21] (03CR) 10Ladsgroup: [C: 03+1] swift: convert dispersion stats cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [16:11:55] (03CR) 10Ladsgroup: thumbor: convert generate-thumbor-age-metrics to timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [16:13:36] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1005.eqiad.wmnet [16:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:48] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on maps1006.eqiad.wmnet with reason: Resyncing from master [16:13:50] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on maps1006.eqiad.wmnet with reason: Resyncing from master [16:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:57] (03CR) 10Dzahn: [C: 04-1] thumbor: convert generate-thumbor-age-metrics to timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [16:14:16] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on maps1006.eqiad.wmnet with reason: Resyncing from master [16:14:18] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on maps1006.eqiad.wmnet with reason: Resyncing from master [16:14:18] (03CR) 10Elukey: "Saw the change passing by :)" [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093) (owner: 10Milimetric) [16:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:30] 10SRE, 10Datacenter-Switchover, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10fgiunchedi) Heads up, ATM swift traffic is in eqiad because of codfw hw rebalance (T288458). The eqiad swift hardware is ready to be put in service now, I'll be... [16:17:17] (03CR) 10Zabe: thumbor: convert systemd-clean-tmpfiles cron to timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719542 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [16:17:57] (03PS1) 10Alexandros Kosiaris: k8s: Instruct docker to keep logs at 100M [puppet] - 10https://gerrit.wikimedia.org/r/719550 (https://phabricator.wikimedia.org/T289578) [16:17:59] (03PS1) 10Alexandros Kosiaris: k8s: Instruct docker to keep logs at 100M, followup [puppet] - 10https://gerrit.wikimedia.org/r/719551 (https://phabricator.wikimedia.org/T289578) [16:20:39] jouncebot: nowandnext [16:20:39] No deployments scheduled for the next 1 hour(s) and 39 minute(s) [16:20:40] In 1 hour(s) and 39 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210908T1800) [16:20:40] In 1 hour(s) and 39 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210908T1800) [16:21:08] (03PS3) 10Alexandros Kosiaris: Update jabram's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/719530 (https://phabricator.wikimedia.org/T290433) [16:21:33] (03PS1) 10Urbanecm: updateMenteeData.php: Make it possible to force update [extensions/GrowthExperiments] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719492 [16:22:01] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1006.eqiad.wmnet [16:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:10] 10SRE, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10Cmjohnson) Sent the requested report to their tech support team. [16:23:05] RECOVERY - Check systemd state on cp5014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:23:44] !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts maps2001.codfw.wmnet [16:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:06] urbanecm: I quickly deploy something [16:25:15] 10SRE, 10Wikifeeds, 10serviceops, 10Patch-For-Review: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10akosiaris) To everyone involved, should we have an incident doc about this? Given the amount of people involved and the amount of time that went... [16:25:26] 10SRE, 10Wikifeeds, 10serviceops, 10Patch-For-Review: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10akosiaris) p:05Triage→03Low [16:25:28] (03CR) 10Ladsgroup: [C: 03+2] Turn off jQuery migrate on wikisource wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719524 (https://phabricator.wikimedia.org/T280944) (owner: 10Ladsgroup) [16:25:33] Amir1: go ahead [16:25:50] I'd like to deploy the backport, but I'm wondering if T290584 will get into my way :D [16:25:50] T290584: CI builds fail with "Module prefix 'pi' is shared between ProofreadPage\Api\ApiQueryProofreadInfo and PageImages\ApiQueryPageImages" - https://phabricator.wikimedia.org/T290584 [16:26:15] (03CR) 10Urbanecm: [C: 03+2] updateMenteeData.php: Make it possible to force update [extensions/GrowthExperiments] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719492 (owner: 10Urbanecm) [16:26:21] i guess i +2 it and see what happens [16:26:31] mine will be quick, won't be on your way [16:26:38] sure, thanks [16:27:09] (03Merged) 10jenkins-bot: Turn off jQuery migrate on wikisource wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719524 (https://phabricator.wikimedia.org/T280944) (owner: 10Ladsgroup) [16:28:47] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:719524|Turn off jQuery migrate on wikisource wikis (T280944)]] (duration: 00m 59s) [16:28:51] PROBLEM - Check systemd state on cp5014 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:52] T280944: Phase out jQuery Migrate v3 - https://phabricator.wikimedia.org/T280944 [16:30:47] RECOVERY - Check systemd state on cp5014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:52] I'm done [16:31:00] thanks [16:31:20] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1062 - https://phabricator.wikimedia.org/T290416 (10Cmjohnson) Disk ordered through Dell tech direct. You have successfully submitted request SR1069791974. [16:31:23] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "workarounding T290584, passed on master, trivial enough to forcemerge" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719492 (owner: 10Urbanecm) [16:32:39] (03PS1) 10Legoktm: sre.switchdc.services: Temporarily exclude swift [cookbooks] - 10https://gerrit.wikimedia.org/r/719556 (https://phabricator.wikimedia.org/T287539) [16:33:56] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.21/extensions/GrowthExperiments/maintenance/updateMenteeData.php: 796e23c87ccfc48334ab932e13aab4f0ec746bbd: updateMenteeData.php: Make it possible to force update (duration: 00m 58s) [16:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:41] * urbanecm done [16:35:49] (03CR) 10Dzahn: "key looks like what is in the ticket, just not sure if we can know it's the right phab user and they have only this single ticket" [puppet] - 10https://gerrit.wikimedia.org/r/719530 (https://phabricator.wikimedia.org/T290433) (owner: 10Alexandros Kosiaris) [16:37:01] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps2001.codfw.wmnet [16:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:25] !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts maps2003.codfw.wmnet [16:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:15] 10SRE, 10DBA, 10observability, 10Datacenter-Switchover: Database read_only alert has a changing service description - https://phabricator.wikimedia.org/T290591 (10RLazarus) [16:39:57] (03CR) 10Dzahn: thumbor: convert systemd-clean-tmpfiles cron to timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719542 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [16:40:26] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1051 - https://phabricator.wikimedia.org/T290442 (10Cmjohnson) ticket with HPE opened Your case was successfully submitted. Please note your Case ID: 5358426636 for future reference. [16:40:43] (03PS2) 10Dzahn: thumbor: convert systemd-clean-tmpfiles cron to timer [puppet] - 10https://gerrit.wikimedia.org/r/719542 (https://phabricator.wikimedia.org/T273673) [16:41:10] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10Cmjohnson) The first thing we will need to do is to try and update the raid controller firmware along with the bios. [16:41:42] (03CR) 10JMeybohm: [C: 03+1] "Yay! Let's test this" [puppet] - 10https://gerrit.wikimedia.org/r/719550 (https://phabricator.wikimedia.org/T289578) (owner: 10Alexandros Kosiaris) [16:41:51] (03CR) 10RLazarus: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [16:49:07] (03PS2) 10JMeybohm: Rakefile: Add tasks to test and diff admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/719540 [16:49:39] (03PS1) 10Hnowlan: kube_env: Give usage when no arguments are passed [puppet] - 10https://gerrit.wikimedia.org/r/719562 [16:52:58] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps2003.codfw.wmnet [16:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:20] !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts maps2004.codfw.wmnet [16:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:00] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: Add tasks to test and diff admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/719540 (owner: 10JMeybohm) [16:57:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=redis_maps site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:01:57] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps2004.codfw.wmnet [17:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:36] (03PS1) 10Dave Pifke: pipeline: include php-excimer and php-redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719565 (https://phabricator.wikimedia.org/T288165) [17:06:22] hnowlan: FYI if that helps, the decom cookbooks accepts any cumin query as selection, can do multiple [17:06:45] oh heh, would have been handy but I'm done now [17:06:46] thanks :) [17:07:08] just noticed :) [17:07:12] 10ops-codfw, 10decommission-hardware, 10Platform Team Workboards (Platform Engineering Reliability): decommission maps2001.codfw.wmnet, maps2002.codfw.wmnet, maps2003.codfw.wmnet, maps2004.codfw.wmnet - https://phabricator.wikimedia.org/T290588 (10hnowlan) [17:18:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:18:39] (03CR) 10Legoktm: [C: 04-1] "Minor inline stuff, otherwise LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/719538 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [17:19:31] (03PS2) 10Dave Pifke: pipeline: include php-excimer extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719565 (https://phabricator.wikimedia.org/T288165) [17:23:44] (03CR) 10Legoktm: "Hm, it might also need adding to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/production-images/+/refs/heads/ma" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719565 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke) [17:32:32] (03CR) 10Volans: "Thanks for the patch! LGTM, few comments/answers inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/719389 (owner: 10RLazarus) [17:34:38] (03CR) 10Volans: [C: 03+1] "At first look looks good to me, but I didn't check the consistency of all the fields added to the schema." [homer/public] - 10https://gerrit.wikimedia.org/r/674318 (https://phabricator.wikimedia.org/T272688) (owner: 10Ayounsi) [17:36:18] (03CR) 10Klausman: [C: 03+1] Add revscoring-editquality as first ml-service to helmfile.d (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [17:40:07] (03PS2) 10Daimona Eaytoy: Stop setting $wgAbuseFilterParserClass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680753 (https://phabricator.wikimedia.org/T239990) [17:42:11] (03PS4) 10Ssingh: durum: switch to client-side UUID generation [puppet] - 10https://gerrit.wikimedia.org/r/719538 (https://phabricator.wikimedia.org/T289536) [17:42:33] (03CR) 10Ssingh: durum: switch to client-side UUID generation (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/719538 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [17:44:10] (03PS3) 10Milimetric: analytics/data_purge: Finish renaming geoeditors_daily to editors_daily [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093) [17:44:47] (03CR) 10jerkins-bot: [V: 04-1] analytics/data_purge: Finish renaming geoeditors_daily to editors_daily [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093) (owner: 10Milimetric) [17:46:27] (03CR) 10Ahmon Dancy: [C: 04-1] "holding." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719565 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke) [18:00:05] Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210908T1800) [18:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210908T1800). [18:00:05] Daimona: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:11] \o [18:00:13] o/ [18:00:17] I can deploy today [18:00:30] Daimona: do you wish to test this patch at a debug server? [18:00:56] (03CR) 10Urbanecm: [C: 03+2] Stop setting $wgAbuseFilterParserClass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680753 (https://phabricator.wikimedia.org/T239990) (owner: 10Daimona Eaytoy) [18:00:57] It shouldn't be necessary [18:00:58] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10wiki_willy) [18:00:59] okay [18:01:04] The config option was killed many months ago [18:01:14] yeah, i guess there's no way to reasonably test [18:01:17] i'll just sync then [18:01:23] Thanks [18:01:26] Daimona: thanks for fixing the GrowthExperiments CI issue btw [18:01:42] (03Merged) 10jenkins-bot: Stop setting $wgAbuseFilterParserClass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680753 (https://phabricator.wikimedia.org/T239990) (owner: 10Daimona Eaytoy) [18:01:50] No worries :) Fixing that stuff is fun [18:03:30] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 950a377e5ba6f5d318135e31b36334532d9ae71b: Stop setting $wgAbuseFilterParserClass (T239990) (duration: 00m 58s) [18:03:35] Daimona: here you go [18:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:36] T239990: Deprecate and then remove the old AbuseFilterParser - https://phabricator.wikimedia.org/T239990 [18:03:39] anything else? [18:03:48] Yay! Thank you! [18:03:59] any time :) [18:05:09] (03PS2) 10Urbanecm: Growth: Remove config that moved on-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717039 (https://phabricator.wikimedia.org/T290295) [18:05:23] (03CR) 10Urbanecm: [C: 03+2] Growth: Remove config that moved on-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717039 (https://phabricator.wikimedia.org/T290295) (owner: 10Urbanecm) [18:06:21] (03Merged) 10jenkins-bot: Growth: Remove config that moved on-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717039 (https://phabricator.wikimedia.org/T290295) (owner: 10Urbanecm) [18:10:23] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: bbefce6a3778f159ad68587c830dff4a1da0c792: Growth: Remove config that moved on-wiki (T290295) (duration: 00m 58s) [18:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:28] T290295: Remove on-wiki Growth config from operations/mediawiki-config - https://phabricator.wikimedia.org/T290295 [18:10:33] * urbanecm done [18:21:03] urbanecm: sorry [18:21:07] had some issues getting in IRC today [18:21:13] did I miss the window? [18:21:28] Jdlrobson: hi! Well, it's still ongoing and i'm happy to deploy, but... [18:21:36] https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210908T1800 doesn't list any patches for your name? [18:21:38] i used the wrong window again didnt it... [18:21:57] a common problem with me [18:22:42] looks so [18:22:47] Moved it. [config] Italian Wikipedia is now a group 1 wiki {{gerrit|715571}} [18:23:07] (03PS6) 10Urbanecm: Italian Wikipedia is now a group 1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson) [18:23:09] let's do it then! [18:23:15] (03CR) 10Urbanecm: [C: 03+2] Italian Wikipedia is now a group 1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson) [18:23:37] sweet! [18:23:59] (03Merged) 10jenkins-bot: Italian Wikipedia is now a group 1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson) [18:24:06] i won't ask you for any tests this time, as effect is only visible when train rides 🙂 [18:24:15] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 1002 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:24:53] urbanecm: Roger that! [18:26:13] !log urbanecm@deploy1002 Synchronized dblists/: 6bcbe61f9a89086b775d84a81d55a7587cf26780: Italian Wikipedia is now a group 1 wiki (T286664; 1/2) (duration: 00m 58s) [18:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:18] T286664: Expand the list of group 1 wikis to contain at least one (preferably 2) smaller "top ten size" wikis - https://phabricator.wikimedia.org/T286664 [18:26:28] (03Abandoned) 10Ryan Kemper: wcqs: create wcqs microsite && move gui [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper) [18:27:28] !log urbanecm@deploy1002 Synchronized wmf-config/config/itwiki.yaml: 6bcbe61f9a89086b775d84a81d55a7587cf26780: Italian Wikipedia is now a group 1 wiki (T286664; 2/2) (duration: 00m 58s) [18:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:41] Jdlrobson: should be live now! [18:27:53] urbanecm: sweet [18:28:07] urbanecm: I presume I should drop a note on the train task for next week. Is there anything else I should be doing? [18:28:20] Jdlrobson: yes, i was going to say that a note on the train task would be helpful [18:28:26] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: When WMF staff requests to be added to ldap/wmf, also add their Phabricator account to #WMF-NDA - https://phabricator.wikimedia.org/T290605 (10Aklapper) [18:28:41] Jdlrobson: maybe a technews notice? [18:29:05] Yep that should have gone out already. [18:29:28] excellent. so i think only the train note would be enough :). [18:31:18] urbanecm: thanks for your help here! [18:31:22] I'm excited to have a a new group 1 wiki [18:31:26] me too! [18:31:47] Jdlrobson: is the NearbyPages extension working fine at beta, btw? :-) [18:33:14] Nice to see itwiki sorted [18:40:38] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: When WMF staff requests to be added to ldap/wmf, also add their Phabricator account to #WMF-NDA - https://phabricator.wikimedia.org/T290605 (10Aklapper) [18:42:48] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: When WMF staff requests to be added to ldap/wmf, also add their Phabricator account to #WMF-NDA - https://phabricator.wikimedia.org/T290605 (10Aklapper) [18:46:21] (03PS8) 10Ryan Kemper: elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 [18:50:46] 10SRE, 10ops-codfw, 10Patch-For-Review: codfw: Ship back Raritan test PDU - https://phabricator.wikimedia.org/T287762 (10Papaul) Shipped out the CPI PDU today [18:55:50] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (owner: 10Ryan Kemper) [19:01:33] (03CR) 10Brennen Bearnes: [C: 03+1] "LGTM - feel free to merge & run if the puppetised version is good to go." [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/719041 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [19:19:09] PROBLEM - MariaDB Replica Lag: s4 on db1150 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1135.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:19:57] PROBLEM - MariaDB Replica Lag: s4 on db2139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1186.65 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:20:23] oh, that is thew new s4 source backups, which doesn't have the alerts hidden [19:20:30] will fix that so it doesn't fire again [19:21:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:(Need By: TBD) rack/setup (4) fundraising hosts - https://phabricator.wikimedia.org/T289812 (10Jgreen) [19:25:12] !log krinkle@mw1369 Running some benchmarks in Eqiad on load.php [19:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:07] urbanecm: yep NearbyPages has been working fine. It's just waiting on performance review now. [19:27:24] Jdlrobson: good luck with perf review then :) [19:47:13] (03CR) 10Jbond: "thanks cwhite for the chat over irc will address comments tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/719368 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond) [19:48:51] (03CR) 10Jbond: puppetmaster: drop log messages from logstash reporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719368 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond) [19:54:17] 10SRE, 10MW-on-K8s, 10serviceops, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10Krinkle) [19:58:51] (03PS1) 10Krinkle: Fix label of rl_css url, improve other labels, add rl_startup url [software/benchmw] - 10https://gerrit.wikimedia.org/r/719608 (https://phabricator.wikimedia.org/T280497) [20:00:04] chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210908T2000). [20:00:24] (03PS2) 10Krinkle: Fix label of rl_css url, improve other labels, add rl_startup url [software/benchmw] - 10https://gerrit.wikimedia.org/r/719608 (https://phabricator.wikimedia.org/T280497) [20:10:08] 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission asw-a-eqiad - https://phabricator.wikimedia.org/T218734 (10Jclark-ctr) [20:10:46] 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission asw-a-eqiad - https://phabricator.wikimedia.org/T218734 (10Jclark-ctr) remove from racks and preformed factory reset [20:10:56] 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission asw-a-eqiad - https://phabricator.wikimedia.org/T218734 (10Jclark-ctr) 05Open→03Resolved [20:23:56] 10SRE, 10ops-eqiad, 10DC-Ops: Netbox Errors in eqiad - https://phabricator.wikimedia.org/T290364 (10Jclark-ctr) Fixed Netbox errors [20:24:05] 10SRE, 10ops-eqiad, 10DC-Ops: Netbox Errors in eqiad - https://phabricator.wikimedia.org/T290364 (10Jclark-ctr) 05Open→03Resolved [20:27:06] (03PS1) 10Dave Pifke: fpm-multiversion-base: add php-excimer extension [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/719609 (https://phabricator.wikimedia.org/T288165) [20:27:29] (03Abandoned) 10Dave Pifke: pipeline: include php-excimer extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719565 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke) [20:33:00] Hey thcipriani: is https://gerrit.wikimedia.org/r/c/mediawiki/core/+/719500 blocking the train or this a phpunit error? [20:33:02] I need some context [20:33:48] hey Jdlrobson sorry, should have noted this was beta: https://en.wikipedia.beta.wmflabs.org/wiki/Special:Version [20:33:56] ^ revert still undeployed yet [20:37:07] (03PS2) 10Legoktm: fpm-multiversion-base: add php-excimer extension [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/719609 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke) [20:38:12] (03CR) 10Legoktm: [V: 03+2 C: 03+2] "PS2: Added a changelog entry" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/719609 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke) [20:39:24] (03PS1) 10Dave Pifke: pipeline: add comment redirecting to correct file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719610 [20:40:03] (03CR) 10Cwhite: puppetmaster: drop log messages from logstash reporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719368 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond) [20:41:37] !log Successfully published image docker-registry.discovery.wmnet/php7.2-fpm-multiversion-base:1.0.2 [20:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:46] ^ Thanks! [20:42:04] :) I think on the next pipeline run it should use the new image [21:04:25] (03PS1) 10Jdlrobson: Unset logo config rather than set to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 [21:08:11] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:09:23] (03CR) 10Krinkle: [C: 04-1] Unset logo config rather than set to false (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 (owner: 10Jdlrobson) [21:14:37] (03CR) 10Jdlrobson: Unset logo config rather than set to false (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 (owner: 10Jdlrobson) [21:16:11] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:17:14] (03PS1) 10Ebernhardson: Revert "Revert "query_service: support multiple variants of wdqs microsite"" [puppet] - 10https://gerrit.wikimedia.org/r/719502 [21:21:40] (03PS2) 10Ebernhardson: Revert "Revert "query_service: support multiple variants of wdqs microsite"" [puppet] - 10https://gerrit.wikimedia.org/r/719502 [21:22:30] (03PS3) 10Ebernhardson: Revert "Revert "query_service: support multiple variants of wdqs microsite"" [puppet] - 10https://gerrit.wikimedia.org/r/719502 (https://phabricator.wikimedia.org/T280247) [21:23:11] (03CR) 10Krinkle: [C: 04-1] Unset logo config rather than set to false (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 (owner: 10Jdlrobson) [21:37:31] PROBLEM - Check systemd state on ncredir3002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:43:01] RECOVERY - Check systemd state on ncredir3002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:59] (03CR) 10Jdlrobson: "Thanks for the help here Timo. Just need to verify from CI this does the right thing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 (owner: 10Jdlrobson) [21:46:49] (03PS4) 10Ryan Kemper: query_service: support multiple variants of wdqs microsite [puppet] - 10https://gerrit.wikimedia.org/r/719502 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [21:48:27] PROBLEM - Check systemd state on ncredir3002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:51:02] (03PS2) 10Jdlrobson: Unset logo config rather than set to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 [21:51:29] (03CR) 10Ryan Kemper: [C: 03+2] query_service: support multiple variants of wdqs microsite [puppet] - 10https://gerrit.wikimedia.org/r/719502 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [21:52:07] RECOVERY - Check systemd state on ncredir3002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:53:11] !log [WDQS] T280247 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/719502 and ran puppet-agent on `miscweb*` [21:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:16] T280247: Create WCQS UI microsite deployment - https://phabricator.wikimedia.org/T280247 [21:55:07] !log [WDQS] T280247 Purged varnish to make sure change took effect: `echo 'https://query-preview.wikidata.org/' | mwscript purgeList.php` and `echo 'https://query.wikidata.org/' | mwscript purgeList.php` on `mwmaint1002` [21:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:35] PROBLEM - Check systemd state on ncredir3002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:04:51] RECOVERY - Check systemd state on ncredir3002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:05:27] (03CR) 10Jdlrobson: [C: 04-1] "Override doesn't appear to be working so will likely to need of a better solution here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 (owner: 10Jdlrobson) [22:08:51] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:09:07] (03PS3) 10Jdlrobson: Unset logo config rather than set to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 [22:14:03] PROBLEM - Check systemd state on ganeti5001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:24:07] !log WDQS] T280247 Ran puppet-agent on `miscweb*` following merge of https://gerrit.wikimedia.org/r/c/wikidata/query/gui-deploy/+/714623 [22:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:12] T280247: Create WCQS UI microsite deployment - https://phabricator.wikimedia.org/T280247 [22:24:33] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:25:02] RECOVERY - Check systemd state on ganeti5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:30:31] PROBLEM - Check systemd state on ganeti5001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:34:09] RECOVERY - Check systemd state on ganeti5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:34:32] !log WDQS] T280247 Ran puppet-agent on `miscweb*` following merge of https://gerrit.wikimedia.org/r/c/wikidata/query/gui-deploy/+/717649 [22:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:38] T280247: Create WCQS UI microsite deployment - https://phabricator.wikimedia.org/T280247 [22:38:19] (03PS1) 10Ebernhardson: wcqs: Set admin groups and cluster to match wdqs [puppet] - 10https://gerrit.wikimedia.org/r/719643 [22:39:41] PROBLEM - Check systemd state on ganeti5001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:43:21] RECOVERY - Check systemd state on ganeti5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:57:35] PROBLEM - Check systemd state on ncredir3001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:05] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210908T2300). [23:00:05] No Gerrit patches in the queue for this window AFAICS. [23:03:05] RECOVERY - Check systemd state on ncredir3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:37] PROBLEM - Check systemd state on ncredir3001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:14:07] RECOVERY - Check systemd state on ncredir3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:19:37] PROBLEM - Check systemd state on ncredir3001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:23:19] RECOVERY - Check systemd state on ncredir3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:25:17] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:38:42] 10SRE, 10DBA, 10observability, 10Datacenter-Switchover: Database read_only alert has a changing service description - https://phabricator.wikimedia.org/T290591 (10RLazarus) [23:39:41] (03CR) 10Legoktm: [C: 03+1] "LGTM, one suggestion inline (a follow-up would be fine)" [puppet] - 10https://gerrit.wikimedia.org/r/719538 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [23:39:44] 10SRE, 10DBA, 10observability, 10Datacenter-Switchover: Database read_only alert has a changing service description - https://phabricator.wikimedia.org/T290591 (10RLazarus) T277174 seems related too.