[00:00:00] <legoktm>	 !log legoktm@lists1001:~$ sudo rm -rf /etc/mailman # cleanup as part of 4869d91b0be / T282303
[00:00:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:00:06] <stashbot>	 T282303: The Great Clean Up of Mailman2  - https://phabricator.wikimedia.org/T282303
[00:00:43] <Amir1>	 I go back to doing chores, ping me if needed
[00:01:19] <legoktm>	 bye :) puppet is all happy now, thanks
[00:05:48] <icinga-wm>	 PROBLEM - Check systemd state on elastic2048 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:08:48] <icinga-wm>	 PROBLEM - Check systemd state on an-tool1005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:13:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10RobH)
[00:21:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10RobH) So I'm updating the firmware and I've applied puppet updates for the installer.  However, the PXE flag needs to be shifted from the 1G to 10G port, which I've intentionally...
[00:28:04] <icinga-wm>	 RECOVERY - Check systemd state on an-tool1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:32:48] <icinga-wm>	 RECOVERY - Check systemd state on elastic2048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:53:44] <icinga-wm>	 RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:27:26] <wikibugs>	 (03PS1) 10RLazarus: Cleanup: Replace all format() calls on string literals with f-strings. [software/spicerack] - 10https://gerrit.wikimedia.org/r/719389
[01:32:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Cleanup: Replace all format() calls on string literals with f-strings. [software/spicerack] - 10https://gerrit.wikimedia.org/r/719389 (owner: 10RLazarus)
[01:48:18] <icinga-wm>	 PROBLEM - Check systemd state on lvs3005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:58:05] <wikibugs>	 (03CR) 10RLazarus: Cleanup: Replace all format() calls on string literals with f-strings. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/719389 (owner: 10RLazarus)
[02:15:18] <icinga-wm>	 RECOVERY - Check systemd state on lvs3005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:25:12] <icinga-wm>	 PROBLEM - Check systemd state on doh4001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:52:04] <icinga-wm>	 RECOVERY - Check systemd state on doh4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:29:53] <wikibugs>	 (03PS1) 10Marostegui: db2094,db2095: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/719397 (https://phabricator.wikimedia.org/T288594)
[04:34:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2094,db2095: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/719397 (https://phabricator.wikimedia.org/T288594) (owner: 10Marostegui)
[04:54:11] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] "This is awesome. The use of the rewrite rule to make this generalization seamless is brilliant" [puppet] - 10https://gerrit.wikimedia.org/r/714624 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson)
[04:55:40] <wikibugs>	 (03CR) 10Ryan Kemper: "Sorry for the delay in getting this shipped. Thanks for the great work on this Zabe!" [puppet] - 10https://gerrit.wikimedia.org/r/716563 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[04:55:44] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] query_service: remove absented query-service-gc-log-cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/716563 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[05:08:35] <wikibugs>	 10Puppet, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations: Puppet failure on integration-puppetmaster-02.integration.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T290422 (10hashar) Who knows? :]  Thank you for the certificates regeneration!
[05:14:54] <wikibugs>	 (03PS4) 10Ryan Kemper: elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532
[05:20:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (owner: 10Ryan Kemper)
[05:29:44] <wikibugs>	 (03PS5) 10Ryan Kemper: elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532
[05:34:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (owner: 10Ryan Kemper)
[05:50:48] <wikibugs>	 (03PS1) 10Majavah: kubeadm: refresh version defaults for 1.19 [puppet] - 10https://gerrit.wikimedia.org/r/719400
[05:50:50] <wikibugs>	 (03PS1) 10Majavah: aptrepo: drop k8s 1.18 updates [puppet] - 10https://gerrit.wikimedia.org/r/719401
[05:50:52] <wikibugs>	 (03PS1) 10Majavah: aprepo: drop k8s 1.18 repo [puppet] - 10https://gerrit.wikimedia.org/r/719402
[05:51:14] <wikibugs>	 (03PS2) 10Majavah: aptrepo: drop k8s 1.18 repo [puppet] - 10https://gerrit.wikimedia.org/r/719402
[05:57:50] <wikibugs>	 (03PS1) 10Majavah: kubeadm::repo: use lsbdistcodename [puppet] - 10https://gerrit.wikimedia.org/r/719403
[06:15:45] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:16:07] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:27:58] <wikibugs>	 (03PS6) 10Ryan Kemper: elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532
[06:30:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: Send email on resolve for wikidata team [puppet] - 10https://gerrit.wikimedia.org/r/719380 (owner: 10Ladsgroup)
[06:33:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (owner: 10Ryan Kemper)
[06:38:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: remove check_grafana_alert [puppet] - 10https://gerrit.wikimedia.org/r/719107 (https://phabricator.wikimedia.org/T281359) (owner: 10Filippo Giunchedi)
[06:40:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add ThanosSidecarUploadFailure to prometheus/ops [puppet] - 10https://gerrit.wikimedia.org/r/719126 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi)
[06:40:49] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: add ThanosSidecarUploadFailure to prometheus/ops [puppet] - 10https://gerrit.wikimedia.org/r/719126 (https://phabricator.wikimedia.org/T288726)
[06:40:56] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert "query_service: support multiple variants of wdqs microsite" [puppet] - 10https://gerrit.wikimedia.org/r/719185
[06:43:17] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] Revert "query_service: support multiple variants of wdqs microsite" [puppet] - 10https://gerrit.wikimedia.org/r/719185 (owner: 10Ryan Kemper)
[06:43:41] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:44:05] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:45:22] <ryankemper>	 !log [WDQS] Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/719185 to rollback query.wikidata.org changes
[06:45:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:46:14] <ryankemper>	 !log [WDQS] Manually running puppet-agent on `miscweb2002.codfw.wmnet,miscweb1002.eqiad.wmnet`
[06:46:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: add udp receive errors for statsd [alerts] - 10https://gerrit.wikimedia.org/r/719123 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi)
[06:48:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] statsd: remove statsd_udp_inbound_errors [puppet] - 10https://gerrit.wikimedia.org/r/719124 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi)
[06:49:59] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:50:03] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:51:18] <wikibugs>	 10SRE, 10SRE-swift-storage: Put ms-be10[64-67] in service - https://phabricator.wikimedia.org/T290546 (10fgiunchedi)
[06:51:31] <wikibugs>	 10SRE, 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be10[64-67] in service - https://phabricator.wikimedia.org/T290546 (10fgiunchedi)
[06:54:15] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Create coolest-tool-academy mailing list for Coolest Tool Award - https://phabricator.wikimedia.org/T290511 (10Aklapper) Thank you Ladsgroup!  Links for my colleagues: * Administration: https://lists.wikimedia.org/postorius/lists/coolest-tool-academy.lists....
[06:57:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I like the idea, however it seems a lot of (effectively) per-host metrics, what sort of insights are you looking for from the metrics? I'm" [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond)
[06:59:26] <wikibugs>	 (03CR) 10Muehlenhoff: puppetmaster: puppet prometheus reporting (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond)
[07:00:48] <wikibugs>	 (03PS7) 10Gehel: elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (owner: 10Ryan Kemper)
[07:02:38] <icinga-wm>	 PROBLEM - Check systemd state on doh2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:02:46] <wikibugs>	 10SRE-swift-storage: Swift users and their usage - https://phabricator.wikimedia.org/T264291 (10fgiunchedi) Thank you for the note @jcrespo, definitely agreed object storage is far more suited for attachments. If OTRS has plans to migrate to moss then definitely we should be taking that into account for the next...
[07:05:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (owner: 10Ryan Kemper)
[07:08:04] <icinga-wm>	 RECOVERY - Check systemd state on doh2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:15:54] <icinga-wm>	 PROBLEM - Check systemd state on doh2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:18:41] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] varnish: Remove Vagrant test scripts [puppet] - 10https://gerrit.wikimedia.org/r/719236 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere)
[07:23:02] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10good first task: Move Varnish test infrastructure from Vagrant to Docker - https://phabricator.wikimedia.org/T286639 (10MMandere) 05Open→03Resolved
[07:27:33] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:45:25] <godog>	 !log start rollout of prometheus-rsyslog-exporter 0.0.0+git20201008-3 to eqsin/esams/ulsfo - T210137
[07:45:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:45:31] <stashbot>	 T210137: Handle unknown stats in rsyslog_exporter - https://phabricator.wikimedia.org/T210137
[07:49:51] <icinga-wm>	 RECOVERY - Check systemd state on doh2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:50:37] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "I checked the flow multiple times and it seems clear, this refactoring is very nice and allows a lot more flexibility to bootstrap system " [deployment-charts] - 10https://gerrit.wikimedia.org/r/719295 (owner: 10JMeybohm)
[07:51:57] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:52:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] admin_ng/main: Create istio-system namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719296 (owner: 10JMeybohm)
[07:59:08] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team (Radar): Requesting exec access to pods in 'ci' namespace staging kubernetes - https://phabricator.wikimedia.org/T290360 (10hashar)
[08:02:18] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs: enforce a minimum spicerack version [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/713812 (owner: 10David Caro)
[08:03:30] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: refresh version defaults for 1.19 [puppet] - 10https://gerrit.wikimedia.org/r/719400 (owner: 10Majavah)
[08:04:33] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: kubeadm::repo: use lsbdistcodename [puppet] - 10https://gerrit.wikimedia.org/r/719403 (owner: 10Majavah)
[08:05:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm::repo: use lsbdistcodename [puppet] - 10https://gerrit.wikimedia.org/r/719403 (owner: 10Majavah)
[08:07:19] <wikibugs>	 (03PS1) 10JMeybohm: charts/secrets: chart type is not valid in apiVersion v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/719465 (https://phabricator.wikimedia.org/T289835)
[08:07:58] <wikibugs>	 (03PS2) 10JMeybohm: charts/secrets: chart type is not valid in apiVersion v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/719465 (https://phabricator.wikimedia.org/T289835)
[08:10:55] <wikibugs>	 (03PS2) 10Vgutierrez: haproxy: Allow using a custom systemd::service template [puppet] - 10https://gerrit.wikimedia.org/r/719282 (https://phabricator.wikimedia.org/T290005)
[08:11:11] <wikibugs>	 (03CR) 10JMeybohm: admin_ng/main: Create istio-system namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719296 (owner: 10JMeybohm)
[08:11:28] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] charts/secrets: chart type is not valid in apiVersion v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/719465 (https://phabricator.wikimedia.org/T289835) (owner: 10JMeybohm)
[08:12:08] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] haproxy: Allow using a custom systemd::service template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719282 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[08:14:06] <wikibugs>	 (03Merged) 10jenkins-bot: charts/secrets: chart type is not valid in apiVersion v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/719465 (https://phabricator.wikimedia.org/T289835) (owner: 10JMeybohm)
[08:14:39] <wikibugs>	 (03CR) 10Elukey: "Thank youuuuu" [deployment-charts] - 10https://gerrit.wikimedia.org/r/719465 (https://phabricator.wikimedia.org/T289835) (owner: 10JMeybohm)
[08:15:34] <wikibugs>	 (03PS1) 10Hashar: contint: do not backup /srv/docker [puppet] - 10https://gerrit.wikimedia.org/r/719466 (https://phabricator.wikimedia.org/T290437)
[08:16:54] <wikibugs>	 (03CR) 10Hashar: "I am 99% sure we dont care about images/containers stored on contint1001/contint2001 under /srv/docker. If we really care about them, they" [puppet] - 10https://gerrit.wikimedia.org/r/719466 (https://phabricator.wikimedia.org/T290437) (owner: 10Hashar)
[08:25:39] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] Revert "mariadb: Set core sections to unidir replication." [puppet] - 10https://gerrit.wikimedia.org/r/719168 (owner: 10Marostegui)
[08:30:43] <wikibugs>	 (03PS3) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791)
[08:33:21] <wikibugs>	 (03CR) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey)
[08:53:20] <wikibugs>	 (03PS1) 10Filippo Giunchedi: POC: override Cumin batch sleep+size from command line [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/719470
[08:54:26] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:01:35] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] "lgtm, I like the refactoring. I've done a diff in staging-eqiad and the only change is renaming of the RoleBinding psp-privileged to allow" [deployment-charts] - 10https://gerrit.wikimedia.org/r/719295 (owner: 10JMeybohm)
[09:03:36] <wikibugs>	 (03PS5) 10Vgutierrez: haproxy: Basic TLS terminator based on HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/715932 (https://phabricator.wikimedia.org/T290005)
[09:03:38] <wikibugs>	 (03PS4) 10Vgutierrez: haproxy: Allow configuring TLS options [puppet] - 10https://gerrit.wikimedia.org/r/716000 (https://phabricator.wikimedia.org/T290005)
[09:03:40] <wikibugs>	 (03PS6) 10Vgutierrez: haproxy: STEK support [puppet] - 10https://gerrit.wikimedia.org/r/716224 (https://phabricator.wikimedia.org/T290005)
[09:03:42] <wikibugs>	 (03PS1) 10Vgutierrez: cache::haproxy: Configure sslcert::ocsp [puppet] - 10https://gerrit.wikimedia.org/r/719471 (https://phabricator.wikimedia.org/T290005)
[09:09:04] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:09:12] <godog>	 !log start rollout of prometheus-rsyslog-exporter 0.0.0+git20201008-3 to eqiad - T210137
[09:09:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:18] <stashbot>	 T210137: Handle unknown stats in rsyslog_exporter - https://phabricator.wikimedia.org/T210137
[09:10:31] <wikibugs>	 (03PS11) 10Jbond: puppetmaster: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585)
[09:12:09] <wikibugs>	 (03PS12) 10Jbond: puppetmaster: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585)
[09:13:01] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "going to -1 this for now in favour of https://gerrit.wikimedia.org/r/c/operations/puppet/+/719368" [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond)
[09:13:10] <icinga-wm>	 PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:13:34] <icinga-wm>	 PROBLEM - Check systemd state on maps1006 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:13:38] <icinga-wm>	 PROBLEM - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:13:44] <icinga-wm>	 PROBLEM - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:20:40] <icinga-wm>	 PROBLEM - Check systemd state on maps1008 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:24:02] <wikibugs>	 (03CR) 10Jbond: "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/719470 (owner: 10Filippo Giunchedi)
[09:26:26] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] "This looks great, thanks :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus)
[09:27:00] <wikibugs>	 (03CR) 10Volans: POC: override Cumin batch sleep+size from command line (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/719470 (owner: 10Filippo Giunchedi)
[09:27:38] <wikibugs>	 (03PS1) 10Jbond: 0.1.2: prepare release [software/statograph] - 10https://gerrit.wikimedia.org/r/719473
[09:28:25] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "+1 on the syntax, cannot speak of the logic, but this would explain the large increase in metadata." [puppet] - 10https://gerrit.wikimedia.org/r/719466 (https://phabricator.wikimedia.org/T290437) (owner: 10Hashar)
[09:28:27] <wikibugs>	 (03PS2) 10Jbond: 0.1.2: prepare release [software/statograph] - 10https://gerrit.wikimedia.org/r/719473 (https://phabricator.wikimedia.org/T290425)
[09:28:56] <wikibugs>	 (03PS1) 10Vgutierrez: haproxy: Accept systemd unit content instead of template path [puppet] - 10https://gerrit.wikimedia.org/r/719474 (https://phabricator.wikimedia.org/T290005)
[09:29:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] haproxy: Accept systemd unit content instead of template path [puppet] - 10https://gerrit.wikimedia.org/r/719474 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[09:29:37] <wikibugs>	 (03PS2) 10Vgutierrez: haproxy: Accept systemd unit content instead of template path [puppet] - 10https://gerrit.wikimedia.org/r/719474 (https://phabricator.wikimedia.org/T290005)
[09:29:39] <godog>	 !log start rollout of prometheus-rsyslog-exporter 0.0.0+git20201008-3 to codfw - T210137
[09:29:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:43] <stashbot>	 T210137: Handle unknown stats in rsyslog_exporter - https://phabricator.wikimedia.org/T210137
[09:32:00] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31026/console" [puppet] - 10https://gerrit.wikimedia.org/r/719474 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[09:34:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] 0.1.2: prepare release [software/statograph] - 10https://gerrit.wikimedia.org/r/719473 (https://phabricator.wikimedia.org/T290425) (owner: 10Jbond)
[09:34:07] <wikibugs>	 (03PS1) 10JMeybohm: toolhub: chart type is not valid in apiVersion v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/719475
[09:34:25] <icinga-wm>	 PROBLEM - Check systemd state on maps2010 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:34:58] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:34:58] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:35:02] <icinga-wm>	 PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:38:21] <godog>	 !log start rollout of prometheus-rsyslog-exporter 0.0.0+git20201008-3 to wikimedia.org - T210137
[09:38:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:25] <stashbot>	 T210137: Handle unknown stats in rsyslog_exporter - https://phabricator.wikimedia.org/T210137
[09:40:25] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Override Cumin batch sleep+size from command line [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/719470
[09:40:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: Override Cumin batch sleep+size from command line (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/719470 (owner: 10Filippo Giunchedi)
[09:42:18] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] haproxy: Accept systemd unit content instead of template path [puppet] - 10https://gerrit.wikimedia.org/r/719474 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[09:46:04] <wikibugs>	 (03PS6) 10Vgutierrez: haproxy: Basic TLS terminator based on HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/715932 (https://phabricator.wikimedia.org/T290005)
[09:46:06] <wikibugs>	 (03PS5) 10Vgutierrez: haproxy: Allow configuring TLS options [puppet] - 10https://gerrit.wikimedia.org/r/716000 (https://phabricator.wikimedia.org/T290005)
[09:46:08] <wikibugs>	 (03PS7) 10Vgutierrez: haproxy: STEK support [puppet] - 10https://gerrit.wikimedia.org/r/716224 (https://phabricator.wikimedia.org/T290005)
[09:46:10] <wikibugs>	 (03PS2) 10Vgutierrez: cache::haproxy: Configure sslcert::ocsp [puppet] - 10https://gerrit.wikimedia.org/r/719471 (https://phabricator.wikimedia.org/T290005)
[09:47:42] <wikibugs>	 (03PS2) 10Jbond: puppetmaster: drop log messages from logstash reporter [puppet] - 10https://gerrit.wikimedia.org/r/719368
[09:48:00] <wikibugs>	 (03PS2) 10Jbond: P:puppetmaster::common: Add back logstash support [puppet] - 10https://gerrit.wikimedia.org/r/719372
[09:53:27] <wikibugs>	 (03CR) 10Kormat: [C: 04-1] sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus)
[09:55:07] <wikibugs>	 (03PS3) 10Jbond: P:puppetmaster::common: Add back logstash support [puppet] - 10https://gerrit.wikimedia.org/r/719372 (https://phabricator.wikimedia.org/T222826)
[09:57:03] <wikibugs>	 (03PS4) 10Jbond: P:puppetmaster::common: Add back logstash support [puppet] - 10https://gerrit.wikimedia.org/r/719372 (https://phabricator.wikimedia.org/T222826)
[09:57:56] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:00:20] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:02:13] <wikibugs>	 (03PS5) 10Jbond: P:puppetmaster::common: Add back logstash support [puppet] - 10https://gerrit.wikimedia.org/r/719372 (https://phabricator.wikimedia.org/T222826)
[10:03:12] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31029/console" [puppet] - 10https://gerrit.wikimedia.org/r/719372 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond)
[10:03:21] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2001.wikimedia.org with reason: upgrade gitlab2001 to new version https://phabricator.wikmiedia.org/T289802
[10:03:23] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2001.wikimedia.org with reason: upgrade gitlab2001 to new version https://phabricator.wikmiedia.org/T289802
[10:03:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:06:52] <jelto>	 !log upgrade gitlab2001 to gitlab-ce=14.0.10-ce.0
[10:06:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:56] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:12:30] <icinga-wm>	 RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:12:42] <icinga-wm>	 RECOVERY - Check systemd state on maps1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:12:44] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:12:44] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:12:48] <icinga-wm>	 RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:13:48] <icinga-wm>	 RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:14:00] <icinga-wm>	 RECOVERY - Check systemd state on maps2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:14:14] <icinga-wm>	 RECOVERY - Check systemd state on maps1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:14:16] <icinga-wm>	 RECOVERY - Check systemd state on maps1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:14:24] <icinga-wm>	 RECOVERY - Check systemd state on maps1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:16:08] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:19:42] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, but I'm not sure how this could affect cloud use cases." [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) (owner: 10Jbond)
[10:24:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Hiera option to enable Ganeti 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811)
[10:24:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add Hiera option to enable Ganeti 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff)
[10:25:20] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:31:04] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:31:21] <wikibugs>	 (03PS2) 10Muehlenhoff: Add Hiera option to enable Ganeti 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811)
[10:31:30] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:31:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add Hiera option to enable Ganeti 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff)
[10:32:50] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:36:18] <wikibugs>	 (03PS3) 10Muehlenhoff: Add Hiera option to enable Ganeti 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811)
[10:40:46] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff)
[10:46:26] <wikibugs>	 (03CR) 10Volans: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus)
[10:52:28] <wikibugs>	 (03PS1) 10Vgutierrez: haproxy: Allow configuring timeouts [puppet] - 10https://gerrit.wikimedia.org/r/719479 (https://phabricator.wikimedia.org/T290005)
[10:55:11] <wikibugs>	 (03PS2) 10Vgutierrez: haproxy: Allow configuring timeouts [puppet] - 10https://gerrit.wikimedia.org/r/719479 (https://phabricator.wikimedia.org/T290005)
[10:57:07] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez)
[10:57:37] <wikibugs>	 (03PS4) 10Muehlenhoff: Add Hiera option to enable Ganeti 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811)
[11:00:05] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210908T1100).
[11:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[11:00:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. I'll test this on cumin2002 and will merge if all is working fine." [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/719470 (owner: 10Filippo Giunchedi)
[11:01:34] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff)
[11:01:50] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1005.eqiad.wmnet
[11:01:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:59] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on maps1005.eqiad.wmnet with reason: Resyncing from master
[11:02:01] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on maps1005.eqiad.wmnet with reason: Resyncing from master
[11:02:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:05] <jbond>	 !log upload statograph_0.1.2
[11:09:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:50] <icinga-wm>	 PROBLEM - puppet last run on sretest1001 is CRITICAL: CRITICAL: Puppet has been disabled for 605106 seconds, message: testing custom network fact - jbond, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:11:08] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Add Hiera option to enable Ganeti 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff)
[11:11:23] <jbond>	 will fix that ^^
[11:14:49] <wikibugs>	 10SRE, 10SRE-OnFire, 10observability, 10User-jbond: statograph_post service fail on alert hosts - https://phabricator.wikimedia.org/T290425 (10jbond) 05Open→03Resolved a:03cmooney I have deployed @cmooney fix will resolve
[11:14:51] <wikibugs>	 10SRE, 10SRE-OnFire, 10observability, 10Patch-For-Review, 10User-jbond: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display - https://phabricator.wikimedia.org/T285569 (10jbond)
[11:16:44] <icinga-wm>	 RECOVERY - puppet last run on sretest1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:18:17] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) (owner: 10Jbond)
[11:23:59] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: DiscussionTools: Make 'newtopictool' available to everyone on arwiki and cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710076 (https://phabricator.wikimedia.org/T285724)
[11:32:26] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:32:54] <wikibugs>	 (03PS5) 10Muehlenhoff: Add Hiera option to enable Ganeti 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811)
[11:34:22] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:34:22] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:34:56] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff)
[11:35:17] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog (Kanban): tegola-vector-tiles load testing and Swift throughput experiments - https://phabricator.wikimedia.org/T284440 (10Jgiannelos)
[11:36:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:39:31] <wikibugs>	 (03CR) 10Ladsgroup: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/719380 (owner: 10Ladsgroup)
[11:40:57] <wikibugs>	 (03PS1) 10Ladsgroup: mailman: Remove absented file definitions [puppet] - 10https://gerrit.wikimedia.org/r/719484 (https://phabricator.wikimedia.org/T282303)
[11:43:29] <wikibugs>	 (03PS5) 10Hnowlan: maps: standardise the maps2.0 config in codfw, remove old nodes [puppet] - 10https://gerrit.wikimedia.org/r/702687 (https://phabricator.wikimedia.org/T269582)
[11:44:24] <wikibugs>	 (03Abandoned) 10Elukey: kubeflow-kfserving-inference: add Secret specs for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/715747 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey)
[11:45:18] <wikibugs>	 (03PS6) 10Muehlenhoff: Add Hiera option to enable Ganeti 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811)
[11:46:50] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] maps: standardise the maps2.0 config in codfw, remove old nodes [puppet] - 10https://gerrit.wikimedia.org/r/702687 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan)
[11:48:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] sre.hosts.decommission: use asset tag if mgmt fails [cookbooks] - 10https://gerrit.wikimedia.org/r/719135 (owner: 10Volans)
[11:50:16] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31030/console" [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[11:56:49] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff)
[11:57:11] <moritzm>	 !log installing curl security updates on stretch
[11:57:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:30] <godog>	 Amir1: you're welcome re: send_resolved: yes, easy enough :) thank you for sending the patch ready to be merged
[12:03:57] <Amir1>	 my pleasure. I got another feature request that I need to dig
[12:04:17] <Amir1>	 "be able to see old alerts. So far I haven't where to extend the time range about what alerts to show, or see the metrics which feed into this system"
[12:06:53] <godog>	 makes sense re: history, we have it in logstash
[12:07:09] <godog>	 I'll add docs/links to the wikitech page
[12:11:12] <Amir1>	 Thanks!
[12:14:29] <godog>	 {{done}}
[12:14:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add Hiera option to enable Ganeti 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/719476 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff)
[12:15:16] <godog>	 Amir1: please send tasks our (o11y) way for feature requests too and happy to discuss
[12:15:55] <Amir1>	 Sure, so far that was it. I let you know if there are more
[12:16:16] <godog>	 sweet
[12:19:48] <wikibugs>	 (03CR) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey)
[12:22:48] <wikibugs>	 (03PS2) 10Muehlenhoff: profile::tlsproxy::instance: Default to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/708969 (https://phabricator.wikimedia.org/T164456)
[12:27:45] <wikibugs>	 (03PS3) 10Jbond: puppetmaster: drop log messages from logstash reporter [puppet] - 10https://gerrit.wikimedia.org/r/719368
[12:29:33] <wikibugs>	 (03PS2) 10Volans: icinga: Add services_downtimed context manager [software/spicerack] - 10https://gerrit.wikimedia.org/r/719356 (owner: 10RLazarus)
[12:30:01] <wikibugs>	 (03PS3) 10Volans: icinga: Add services_downtimed context manager [software/spicerack] - 10https://gerrit.wikimedia.org/r/719356 (owner: 10RLazarus)
[12:30:17] <wikibugs>	 (03CR) 10Volans: "I've took the liberty to fix the 2 nits" [software/spicerack] - 10https://gerrit.wikimedia.org/r/719356 (owner: 10RLazarus)
[12:30:23] <wikibugs>	 (03PS4) 10Jbond: puppetmaster: drop log messages from logstash reporter [puppet] - 10https://gerrit.wikimedia.org/r/719368
[12:30:44] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/708969 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff)
[12:31:15] <wikibugs>	 10SRE, 10Datacenter-Switchover, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Marostegui)
[12:31:17] <wikibugs>	 (03CR) 10Jbond: "see https://phabricator.wikimedia.org/P17252 for an example of the json blob that would be sent" [puppet] - 10https://gerrit.wikimedia.org/r/719368 (owner: 10Jbond)
[12:33:28] <wikibugs>	 (03PS5) 10Jbond: puppetmaster: drop log messages from logstash reporter [puppet] - 10https://gerrit.wikimedia.org/r/719368
[12:36:28] <icinga-wm>	 PROBLEM - Check systemd state on maps2004 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:37:31] <wikibugs>	 (03PS6) 10Jbond: puppetmaster: drop log messages from logstash reporter [puppet] - 10https://gerrit.wikimedia.org/r/719368 (https://phabricator.wikimedia.org/T222826)
[12:38:01] <wikibugs>	 (03CR) 10Volans: [C: 03+2] icinga: Add services_downtimed context manager [software/spicerack] - 10https://gerrit.wikimedia.org/r/719356 (owner: 10RLazarus)
[12:43:04] <wikibugs>	 (03Merged) 10jenkins-bot: icinga: Add services_downtimed context manager [software/spicerack] - 10https://gerrit.wikimedia.org/r/719356 (owner: 10RLazarus)
[12:45:07] <brennen>	 !log gitlab: pausing all runners in preparation for upgrade to 14.0.10 (T289802)
[12:45:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:14] <stashbot>	 T289802: GitLab major version upgrade: 14.x - https://phabricator.wikimedia.org/T289802
[12:49:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Update repository hook for Gitlab 14.1 [puppet] - 10https://gerrit.wikimedia.org/r/719512 (https://phabricator.wikimedia.org/T289802)
[12:50:31] <wikibugs>	 (03PS1) 10Jelto: aptrepo::files::updates Update repository hook for gitlab-runner 14 [puppet] - 10https://gerrit.wikimedia.org/r/719513 (https://phabricator.wikimedia.org/T289802)
[12:52:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/719513 (https://phabricator.wikimedia.org/T289802) (owner: 10Jelto)
[12:54:19] <wikibugs>	 (03CR) 10Brennen Bearnes: "Should we just do the runner upgrade directly to final version, and keep them paused while we upgrade intermediary versions of GitLab itse" [puppet] - 10https://gerrit.wikimedia.org/r/719513 (https://phabricator.wikimedia.org/T289802) (owner: 10Jelto)
[12:57:25] <wikibugs>	 (03PS3) 10Ayounsi: JSON schema, add coverage to secrets [homer/public] - 10https://gerrit.wikimedia.org/r/674318 (https://phabricator.wikimedia.org/T272688)
[12:59:13] <wikibugs>	 (03CR) 10Jelto: aptrepo::files::updates Update repository hook for gitlab-runner 14 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719513 (https://phabricator.wikimedia.org/T289802) (owner: 10Jelto)
[12:59:39] <wikibugs>	 (03CR) 10Brennen Bearnes: aptrepo::files::updates Update repository hook for gitlab-runner 14 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719513 (https://phabricator.wikimedia.org/T289802) (owner: 10Jelto)
[12:59:42] <wikibugs>	 (03PS4) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791)
[12:59:44] <wikibugs>	 (03PS1) 10Elukey: kubeflow-kfserving-inference: avoid repetitions with multi-models [deployment-charts] - 10https://gerrit.wikimedia.org/r/719515 (https://phabricator.wikimedia.org/T286791)
[13:01:23] <wikibugs>	 (03CR) 10Brennen Bearnes: aptrepo::files::updates Update repository hook for gitlab-runner 14 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719513 (https://phabricator.wikimedia.org/T289802) (owner: 10Jelto)
[13:03:18] <wikibugs>	 (03PS2) 10Jelto: aptrepo::files::updates Update repository hook for gitlab-runner 14 [puppet] - 10https://gerrit.wikimedia.org/r/719513 (https://phabricator.wikimedia.org/T289802)
[13:03:50] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:04:12] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1616 and 3116 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[13:04:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the sample JSON, I'll let Cole comment on that but re: the mechanics it is sufficient to log json to local syslog (and allow" [puppet] - 10https://gerrit.wikimedia.org/r/719368 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond)
[13:06:18] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:07:42] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:08:16] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:12:10] <wikibugs>	 (03CR) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey)
[13:13:17] <brennen>	 !log gitlab1001: downtiming alerts for 2.5 hours; upgrading to 14.0.10 (T289802)
[13:13:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:22] <stashbot>	 T289802: GitLab major version upgrade: 14.x - https://phabricator.wikimedia.org/T289802
[13:14:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Makes sense!" [puppet] - 10https://gerrit.wikimedia.org/r/719513 (https://phabricator.wikimedia.org/T289802) (owner: 10Jelto)
[13:17:34] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[13:19:30] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[13:22:08] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] Update repository hook for Gitlab 14.1 [puppet] - 10https://gerrit.wikimedia.org/r/719512 (https://phabricator.wikimedia.org/T289802) (owner: 10Muehlenhoff)
[13:23:44] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] Update repository hook for Gitlab 14.1 [puppet] - 10https://gerrit.wikimedia.org/r/719512 (https://phabricator.wikimedia.org/T289802) (owner: 10Muehlenhoff)
[13:24:10] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] aptrepo::files::updates Update repository hook for gitlab-runner 14 [puppet] - 10https://gerrit.wikimedia.org/r/719513 (https://phabricator.wikimedia.org/T289802) (owner: 10Jelto)
[13:24:35] <wikibugs>	 (03PS3) 10Jelto: aptrepo::files::updates Update repository hook for gitlab-runner 14 [puppet] - 10https://gerrit.wikimedia.org/r/719513 (https://phabricator.wikimedia.org/T289802)
[13:26:46] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] profile::tlsproxy::instance: Default to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/708969 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff)
[13:29:25] <wikibugs>	 (03PS1) 10Jbond: puppetdb: log "long" autovacuum tasks [puppet] - 10https://gerrit.wikimedia.org/r/719518 (https://phabricator.wikimedia.org/T263578)
[13:30:09] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31031/console" [puppet] - 10https://gerrit.wikimedia.org/r/719518 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond)
[13:33:18] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb: log "long" autovacuum tasks [puppet] - 10https://gerrit.wikimedia.org/r/719518 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond)
[13:38:49] <brennen>	 !log gitlab: upgrading gitlab2001, followed by gitlab1001, to 14.1.5 (T289802)
[13:38:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:54] <stashbot>	 T289802: GitLab major version upgrade: 14.x - https://phabricator.wikimedia.org/T289802
[13:39:09] <wikibugs>	 (03PS1) 10Jelto: aptrepo::files::updates Update repository hook for Gitlab 14.2 [puppet] - 10https://gerrit.wikimedia.org/r/719519 (https://phabricator.wikimedia.org/T289802)
[13:40:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/719519 (https://phabricator.wikimedia.org/T289802) (owner: 10Jelto)
[13:44:10] <logmsgbot>	 !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@eb211ac]: kartotherian: restore v4 maxzoom to z15
[13:44:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:04] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:48:46] <wikibugs>	 (03PS1) 10Elukey: Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791)
[13:50:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:50:12] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) After doing the work to compact this again the database as a whole is most tables have a row count of either equal to the number of hosts...
[13:50:52] <logmsgbot>	 !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@eb211ac]: kartotherian: restore v4 maxzoom to z15 (duration: 06m 42s)
[13:50:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey)
[13:55:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:57:05] <wikibugs>	 (03CR) 10Herron: [C: 03+1] rsyslog: stop saving trafficserver logs to disk [puppet] - 10https://gerrit.wikimedia.org/r/719052 (https://phabricator.wikimedia.org/T290305) (owner: 10Ema)
[13:57:41] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] aptrepo::files::updates Update repository hook for Gitlab 14.2 [puppet] - 10https://gerrit.wikimedia.org/r/719519 (https://phabricator.wikimedia.org/T289802) (owner: 10Jelto)
[13:57:46] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:59:10] <wikibugs>	 (03PS2) 10Elukey: Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791)
[13:59:31] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] aptrepo::files::updates Update repository hook for Gitlab 14.2 [puppet] - 10https://gerrit.wikimedia.org/r/719519 (https://phabricator.wikimedia.org/T289802) (owner: 10Jelto)
[14:01:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey)
[14:02:27] <wikibugs>	 (03PS1) 10MMandere: puppetmaster: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/719523 (https://phabricator.wikimedia.org/T282787)
[14:04:04] <wikibugs>	 (03PS2) 10Elukey: kubeflow-kfserving-inference: avoid repetitions with multi-models [deployment-charts] - 10https://gerrit.wikimedia.org/r/719515 (https://phabricator.wikimedia.org/T286791)
[14:04:06] <wikibugs>	 (03PS5) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791)
[14:04:08] <wikibugs>	 (03PS3) 10Elukey: Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791)
[14:05:00] <wikibugs>	 (03PS1) 10Ladsgroup: Turn off jQuery migrate on wikisource wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719524 (https://phabricator.wikimedia.org/T280944)
[14:05:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey)
[14:08:30] <wikibugs>	 (03PS3) 10Elukey: kubeflow-kfserving-inference: avoid repetitions with multi-models [deployment-charts] - 10https://gerrit.wikimedia.org/r/719515 (https://phabricator.wikimedia.org/T286791)
[14:08:32] <wikibugs>	 (03PS6) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791)
[14:08:34] <wikibugs>	 (03PS4) 10Elukey: Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791)
[14:10:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey)
[14:13:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/719523 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere)
[14:15:04] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] puppetmaster: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/719523 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere)
[14:17:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: profile: restart postgres on first install / bootstrap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705704 (owner: 10Filippo Giunchedi)
[14:25:05] <wikibugs>	 (03PS1) 10MMandere: prometheus: Add drmrs DC Site [puppet] - 10https://gerrit.wikimedia.org/r/719526 (https://phabricator.wikimedia.org/T282787)
[14:28:57] <wikibugs>	 (03CR) 10Jbond: profile: restart postgres on first install / bootstrap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705704 (owner: 10Filippo Giunchedi)
[14:33:58] <moritzm>	 !log installing zeromq3 security updates
[14:34:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ms-be1067.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2...
[14:37:43] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: Netbox Errors in codfw - https://phabricator.wikimedia.org/T290362 (10Papaul) 05Open→03Resolved Complete
[14:40:13] <wikibugs>	 (03PS2) 10Herron: slo_dashboards: add cluster_label_query and set default [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/717587 (https://phabricator.wikimedia.org/T289036)
[14:42:56] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] contint: do not backup /srv/docker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719466 (https://phabricator.wikimedia.org/T290437) (owner: 10Hashar)
[14:47:21] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "git clone from the new place works for me" [puppet] - 10https://gerrit.wikimedia.org/r/719363 (https://phabricator.wikimedia.org/T290259) (owner: 10Brennen Bearnes)
[14:49:43] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Update jabram's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/719530 (https://phabricator.wikimedia.org/T209433)
[14:49:52] <wikibugs>	 10SRE, 10SRE-Access-Requests: Replace JAbrams' old ssh public key with a new one - https://phabricator.wikimedia.org/T290433 (10akosiaris) p:05Triage→03Medium
[14:51:20] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission pc2008.codfw.wmnet - https://phabricator.wikimedia.org/T289115 (10Papaul)
[14:51:36] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1067.eqiad.wmnet with reason: REIMAGE
[14:51:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] contint: do not backup /srv/docker [puppet] - 10https://gerrit.wikimedia.org/r/719466 (https://phabricator.wikimedia.org/T290437) (owner: 10Hashar)
[14:51:53] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2009.codfw.wmnet - https://phabricator.wikimedia.org/T289116 (10Papaul)
[14:51:57] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission pc2008.codfw.wmnet - https://phabricator.wikimedia.org/T289115 (10Papaul) 05Open→03Resolved complete
[14:52:25] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2009.codfw.wmnet - https://phabricator.wikimedia.org/T289116 (10Papaul) 05Open→03Resolved complete
[14:53:00] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2010.codfw.wmnet - https://phabricator.wikimedia.org/T289117 (10Papaul)
[14:53:42] <mutante>	 mmandere: hi, we got a puppetmaster merge conflict, I don't mean to rush you though
[14:53:46] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be1067.eqiad.wmnet with reason: REIMAGE
[14:53:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:08] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2010.codfw.wmnet - https://phabricator.wikimedia.org/T289117 (10Papaul) 05Open→03Resolved complete
[14:54:26] <brennen>	 !log gitlab: upgrading gitlab2001, followed by gitlab1001, to 14.2.3 (T289802)
[14:54:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:31] <stashbot>	 T289802: GitLab major version upgrade: 14.x - https://phabricator.wikimedia.org/T289802
[14:56:13] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team (Radar): Requesting exec access to pods in 'ci' namespace staging kubernetes - https://phabricator.wikimedia.org/T290360 (10akosiaris) p:05Triage→03Medium Hi @dancy   As in personal level access? We don't have user level accounts, so it would be so...
[14:57:20] <moritzm>	 !log installing 4.19.194 kernels on stretch systems with 4.19.x (no reboots yet)
[14:57:22] <wikibugs>	 (03PS26) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079)
[14:57:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:36] <marostegui>	 !log Retroactive: started to warm up eqiad databaes
[14:57:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:01] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Chmielko Maslak - https://phabricator.wikimedia.org/T289257 (10akosiaris) >>! In T289257#7327704, @fgiunchedi wrote: > @chmielkomaslak  access has been set up, please confirm the following: >...
[14:59:04] <mutante>	 jbond: it's safe to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/719523 on the master, right?
[14:59:19] <mutante>	 "just" domain_search but on puppet masters
[14:59:29] <wikibugs>	 (03CR) 10Herron: "please see updated preview at https://grafana.wikimedia.org/dashboard/snapshot/yNhvI9nsW7T9O4d09qpnq0rCVr1IjrpQ" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/717587 (https://phabricator.wikimedia.org/T289036) (owner: 10Herron)
[14:59:31] <wikibugs>	 (03PS3) 10Herron: slo_dashboards: add cluster_label_query and set default [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/717587 (https://phabricator.wikimedia.org/T289036)
[15:00:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) (owner: 10Jbond)
[15:01:10] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team (Radar): Requesting exec access to pods in 'ci' namespace staging kubernetes - https://phabricator.wikimedia.org/T290360 (10dancy) >>! In T290360#7339450, @akosiaris wrote: > Hi @dancy  >  > As in personal level access? We don't have user level account...
[15:01:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be1067.eqiad.wmnet'] `  and were **ALL** successful.
[15:02:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10RobH)
[15:02:49] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[15:02:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10RobH) 05Open→03Resolved all hosts installed and staged
[15:03:38] <mutante>	 alright, I'll merge both changes
[15:04:01] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[15:04:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "it's a wrapper for /usr/local/bin/safe-service-restart" [puppet] - 10https://gerrit.wikimedia.org/r/719307 (https://phabricator.wikimedia.org/T290038) (owner: 10Ahmon Dancy)
[15:06:39] <wikibugs>	 (03CR) 10Dzahn: "this wasn't merged yet on the puppetmaster, I did so just now" [puppet] - 10https://gerrit.wikimedia.org/r/719523 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere)
[15:06:46] <wikibugs>	 (03PS40) 10Btullis: Install Alluxio to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641)
[15:08:15] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gitlab,rails,redis_gitlab,sidekiq} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:08:25] <wikibugs>	 (03PS1) 10Filippo Giunchedi: swift: add ms-be10[64-67] [puppet] - 10https://gerrit.wikimedia.org/r/719532 (https://phabricator.wikimedia.org/T290546)
[15:09:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] swift: add ms-be10[64-67] [puppet] - 10https://gerrit.wikimedia.org/r/719532 (https://phabricator.wikimedia.org/T290546) (owner: 10Filippo Giunchedi)
[15:10:44] <wikibugs>	 (03CR) 10Dzahn: swift: convert dispersion stats cron to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[15:11:03] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:16:17] <wikibugs>	 (03PS4) 10Dzahn: swift: convert dispersion stats cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673)
[15:16:59] <wikibugs>	 (03CR) 10Dzahn: swift: convert dispersion stats cron to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[15:17:11] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Mew Ophaswongse - https://phabricator.wikimedia.org/T290200 (10akosiaris) @DMburugu Hi! Your approval is required on this task.
[15:19:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] P:puppet: Add alerting for large files in client bucket [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) (owner: 10Jbond)
[15:19:17] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Analytics: Requesting access to analytics-privatedata-users group for Abban Dunne - https://phabricator.wikimedia.org/T289775 (10akosiaris) @odimitrijevic Hi! Your approval is required on this task.
[15:21:07] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Add the ability to generate comparisions of latency percentiles [software/benchmw] - 10https://gerrit.wikimedia.org/r/719105
[15:21:30] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "looks good to me, if there is any concern here it is really just that alert1001 does not get overloaded with too many checks" [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) (owner: 10Jbond)
[15:21:55] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Update jabram's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/719530 (https://phabricator.wikimedia.org/T209433)
[15:23:12] <wikibugs>	 10SRE, 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be10[64-67] in service - https://phabricator.wikimedia.org/T290546 (10akosiaris) p:05Triage→03Medium
[15:24:20] <wikibugs>	 (03CR) 10Dzahn: "I think you got the wrong ticket link?" [puppet] - 10https://gerrit.wikimedia.org/r/719530 (https://phabricator.wikimedia.org/T209433) (owner: 10Alexandros Kosiaris)
[15:26:21] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Remove user greta from admin/ [puppet] - 10https://gerrit.wikimedia.org/r/719536 (https://phabricator.wikimedia.org/T290423)
[15:26:31] <wikibugs>	 (03CR) 10Jbond: "updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) (owner: 10Jbond)
[15:26:49] <wikibugs>	 (03PS27) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079)
[15:27:32] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Absent greta ldap account - https://phabricator.wikimedia.org/T290423 (10akosiaris) p:05Triage→03High Removed user from the nda group. I 'll merge the puppet change as well.
[15:28:09] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove user greta from admin/ [puppet] - 10https://gerrit.wikimedia.org/r/719536 (https://phabricator.wikimedia.org/T290423) (owner: 10Alexandros Kosiaris)
[15:29:16] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Absent greta ldap account - https://phabricator.wikimedia.org/T290423 (10akosiaris) 05Open→03Resolved a:03akosiaris Resolving. Change merged, user absent in puppet's admin/ module.
[15:29:39] <wikibugs>	 (03CR) 10Dzahn: "also see https://wikitech.wikimedia.org/wiki/SRE_Offboarding#Check_Users_LDAP_access and it might need to go into the special group for ab" [puppet] - 10https://gerrit.wikimedia.org/r/719536 (https://phabricator.wikimedia.org/T290423) (owner: 10Alexandros Kosiaris)
[15:29:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) (owner: 10Jbond)
[15:31:17] <wikibugs>	 (03CR) 10RLazarus: icinga: Add services_downtimed context manager (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/719356 (owner: 10RLazarus)
[15:32:05] <wikibugs>	 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-8), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10akosiaris) @ArielGlenn Thanks for taking over this. Let us know if you need any help!
[15:34:05] <wikibugs>	 (03PS1) 10Ssingh: durum: switch to client-side UUID generation [puppet] - 10https://gerrit.wikimedia.org/r/719538 (https://phabricator.wikimedia.org/T289536)
[15:35:08] <wikibugs>	 (03PS3) 10JMeybohm: Rakefile: Add task validate_istio_config [deployment-charts] - 10https://gerrit.wikimedia.org/r/719272
[15:35:10] <wikibugs>	 (03PS6) 10JMeybohm: custom_deploy: Add istio manifest for main clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/719265 (https://phabricator.wikimedia.org/T287007)
[15:35:12] <wikibugs>	 (03PS4) 10JMeybohm: admin_ng: Support managing of system namespaces with helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/719295
[15:35:14] <wikibugs>	 (03PS5) 10JMeybohm: admin_ng/main: Create istio-system namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/719296
[15:35:16] <wikibugs>	 (03PS1) 10JMeybohm: Rakefile: Remove check_docker. It's already in utils.rb [deployment-charts] - 10https://gerrit.wikimedia.org/r/719539
[15:35:18] <wikibugs>	 (03PS1) 10JMeybohm: Rakefile: Add tasks to test and diff admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/719540
[15:37:31] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1064.eqiad.wmnet
[15:37:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:48] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1066.eqiad.wmnet
[15:37:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:55] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1065.eqiad.wmnet
[15:37:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:04] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1067.eqiad.wmnet
[15:38:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:08] <wikibugs>	 10SRE, 10serviceops: Pods in evicted state for various namespaces in k8s main - https://phabricator.wikimedia.org/T290444 (10akosiaris) For what is worth, evictions are not a bad thing per se in kubernetes. They can happen for a variety of reasons, notably:  * `DiskPressure` -- Usable disk is running out on th...
[15:39:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Rakefile: Add tasks to test and diff admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/719540 (owner: 10JMeybohm)
[15:39:41] <wikibugs>	 (03PS2) 10Ssingh: durum: switch to client-side UUID generation [puppet] - 10https://gerrit.wikimedia.org/r/719538 (https://phabricator.wikimedia.org/T289536)
[15:40:55] <wikibugs>	 (03CR) 10Jdlrobson: Italian Wikipedia is now a group 1 wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson)
[15:41:21] <wikibugs>	 (03CR) 10Elukey: "The change is very big :) I left some comments but overall it looks very good. I am a bit on the fence for the ssh keys parts, I have neve" [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis)
[15:41:41] <wikibugs>	 10SRE, 10serviceops: Pods in evicted state for various namespaces in k8s main - https://phabricator.wikimedia.org/T290444 (10akosiaris) 05Open→03Resolved a:03akosiaris Per the above the answer to   `Is it normal that pods are in this state? If not, let's investigate and then add an alarm :)`   is "Mostly...
[15:41:58] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1064.eqiad.wmnet
[15:42:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:00] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1066.eqiad.wmnet
[15:43:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:13] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1065.eqiad.wmnet
[15:43:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:22] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1067.eqiad.wmnet
[15:43:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:22] <wikibugs>	 (03CR) 10Cwhite: "Looking good!  I like how this prevents leaking sensitive data to logstash." [puppet] - 10https://gerrit.wikimedia.org/r/719368 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond)
[15:46:14] <wikibugs>	 (03CR) 10Elukey: "Question about dfs.permissions.superusergroup - is the alluxio user going" [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis)
[15:47:59] <wikibugs>	 (03PS1) 10Dzahn: thumbor: convert systemd-clean-tmpfiles cron to timer [puppet] - 10https://gerrit.wikimedia.org/r/719542 (https://phabricator.wikimedia.org/T273673)
[15:51:39] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31035/console" [puppet] - 10https://gerrit.wikimedia.org/r/719538 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh)
[15:52:19] <wikibugs>	 (03CR) 10Elukey: Install Alluxio to the test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis)
[15:53:18] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Rakefile: Remove check_docker. It's already in utils.rb [deployment-charts] - 10https://gerrit.wikimedia.org/r/719539 (owner: 10JMeybohm)
[15:54:24] <wikibugs>	 (03PS1) 10Dzahn: thumbor: convert generate-thumbor-age-metrics to timer [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673)
[15:55:14] <wikibugs>	 (03CR) 10Zabe: swift: convert dispersion stats cron to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[15:57:01] <wikibugs>	 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) Some additional RBAC requirements:  on `releases1002` and `releases2002` helm is used as well. So when migrating, we have to make sure that the [user](https://gerrit.wikimedia.org...
[16:00:09] <wikibugs>	 (03PS2) 10Milimetric: analytics/data_purge: Finish renaming geoeditors_daily to editors_daily [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093)
[16:00:31] <wikibugs>	 (03CR) 10Milimetric: analytics/data_purge: Finish renaming geoeditors_daily to editors_daily (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093) (owner: 10Milimetric)
[16:00:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] analytics/data_purge: Finish renaming geoeditors_daily to editors_daily [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093) (owner: 10Milimetric)
[16:01:01] <icinga-wm>	 PROBLEM - Check systemd state on cp5014 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:01:47] <wikibugs>	 (03PS5) 10Dzahn: swift: convert dispersion stats cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673)
[16:02:39] <wikibugs>	 (03CR) 10Dzahn: swift: convert dispersion stats cron to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[16:04:20] <wikibugs>	 (03PS2) 10Dzahn: thumbor: convert generate-thumbor-age-metrics to timer [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673)
[16:07:50] <wikibugs>	 (03PS3) 10Ssingh: durum: switch to client-side UUID generation [puppet] - 10https://gerrit.wikimedia.org/r/719538 (https://phabricator.wikimedia.org/T289536)
[16:09:10] <wikibugs>	 (03CR) 10Ladsgroup: swift: convert dispersion stats cron to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[16:09:21] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] swift: convert dispersion stats cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[16:11:55] <wikibugs>	 (03CR) 10Ladsgroup: thumbor: convert generate-thumbor-age-metrics to timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[16:13:36] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1005.eqiad.wmnet
[16:13:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:48] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on maps1006.eqiad.wmnet with reason: Resyncing from master
[16:13:50] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on maps1006.eqiad.wmnet with reason: Resyncing from master
[16:13:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:57] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] thumbor: convert generate-thumbor-age-metrics to timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[16:14:16] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on maps1006.eqiad.wmnet with reason: Resyncing from master
[16:14:18] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on maps1006.eqiad.wmnet with reason: Resyncing from master
[16:14:18] <wikibugs>	 (03CR) 10Elukey: "Saw the change passing by :)" [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093) (owner: 10Milimetric)
[16:14:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:30] <wikibugs>	 10SRE, 10Datacenter-Switchover, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10fgiunchedi) Heads up, ATM swift traffic is in eqiad because of codfw hw rebalance (T288458). The eqiad swift hardware is ready to be put in service now, I'll be...
[16:17:17] <wikibugs>	 (03CR) 10Zabe: thumbor: convert systemd-clean-tmpfiles cron to timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719542 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[16:17:57] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: k8s: Instruct docker to keep logs at 100M [puppet] - 10https://gerrit.wikimedia.org/r/719550 (https://phabricator.wikimedia.org/T289578)
[16:17:59] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: k8s: Instruct docker to keep logs at 100M, followup [puppet] - 10https://gerrit.wikimedia.org/r/719551 (https://phabricator.wikimedia.org/T289578)
[16:20:39] <urbanecm>	 jouncebot: nowandnext
[16:20:39] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 39 minute(s)
[16:20:40] <jouncebot>	 In 1 hour(s) and 39 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210908T1800)
[16:20:40] <jouncebot>	 In 1 hour(s) and 39 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210908T1800)
[16:21:08] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Update jabram's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/719530 (https://phabricator.wikimedia.org/T290433)
[16:21:33] <wikibugs>	 (03PS1) 10Urbanecm: updateMenteeData.php: Make it possible to force update [extensions/GrowthExperiments] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719492
[16:22:01] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1006.eqiad.wmnet
[16:22:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10Cmjohnson) Sent the requested report to their tech support team.
[16:23:05] <icinga-wm>	 RECOVERY - Check systemd state on cp5014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:23:44] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts maps2001.codfw.wmnet
[16:23:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:06] <Amir1>	 urbanecm: I quickly deploy something
[16:25:15] <wikibugs>	 10SRE, 10Wikifeeds, 10serviceops, 10Patch-For-Review: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10akosiaris) To everyone involved, should we have an incident doc about this? Given the amount of people involved and the amount of time that went...
[16:25:26] <wikibugs>	 10SRE, 10Wikifeeds, 10serviceops, 10Patch-For-Review: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10akosiaris) p:05Triage→03Low
[16:25:28] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Turn off jQuery migrate on wikisource wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719524 (https://phabricator.wikimedia.org/T280944) (owner: 10Ladsgroup)
[16:25:33] <urbanecm>	 Amir1: go ahead
[16:25:50] <urbanecm>	 I'd like to deploy the backport, but I'm wondering if T290584 will get into my way :D
[16:25:50] <stashbot>	 T290584: CI builds fail with "Module prefix 'pi' is shared between ProofreadPage\Api\ApiQueryProofreadInfo and PageImages\ApiQueryPageImages" - https://phabricator.wikimedia.org/T290584
[16:26:15] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] updateMenteeData.php: Make it possible to force update [extensions/GrowthExperiments] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719492 (owner: 10Urbanecm)
[16:26:21] <urbanecm>	 i guess i +2 it and see what happens
[16:26:31] <Amir1>	 mine will be quick, won't be on your way
[16:26:38] <urbanecm>	 sure, thanks
[16:27:09] <wikibugs>	 (03Merged) 10jenkins-bot: Turn off jQuery migrate on wikisource wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719524 (https://phabricator.wikimedia.org/T280944) (owner: 10Ladsgroup)
[16:28:47] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:719524|Turn off jQuery migrate on wikisource wikis (T280944)]] (duration: 00m 59s)
[16:28:51] <icinga-wm>	 PROBLEM - Check systemd state on cp5014 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:28:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:52] <stashbot>	 T280944: Phase out jQuery Migrate v3 - https://phabricator.wikimedia.org/T280944
[16:30:47] <icinga-wm>	 RECOVERY - Check systemd state on cp5014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:30:52] <Amir1>	 I'm done
[16:31:00] <urbanecm>	 thanks
[16:31:20] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1062 - https://phabricator.wikimedia.org/T290416 (10Cmjohnson) Disk ordered through Dell tech direct. You have successfully submitted request SR1069791974.
[16:31:23] <wikibugs>	 (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "workarounding T290584, passed on master, trivial enough to forcemerge" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719492 (owner: 10Urbanecm)
[16:32:39] <wikibugs>	 (03PS1) 10Legoktm: sre.switchdc.services: Temporarily exclude swift [cookbooks] - 10https://gerrit.wikimedia.org/r/719556 (https://phabricator.wikimedia.org/T287539)
[16:33:56] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.21/extensions/GrowthExperiments/maintenance/updateMenteeData.php: 796e23c87ccfc48334ab932e13aab4f0ec746bbd: updateMenteeData.php: Make it possible to force update (duration: 00m 58s)
[16:33:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:41] * urbanecm done
[16:35:49] <wikibugs>	 (03CR) 10Dzahn: "key looks like what is in the ticket, just not sure if we can know it's the right phab user and they have only this single ticket" [puppet] - 10https://gerrit.wikimedia.org/r/719530 (https://phabricator.wikimedia.org/T290433) (owner: 10Alexandros Kosiaris)
[16:37:01] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps2001.codfw.wmnet
[16:37:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:25] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts maps2003.codfw.wmnet
[16:37:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:15] <wikibugs>	 10SRE, 10DBA, 10observability, 10Datacenter-Switchover: Database read_only alert has a changing service description - https://phabricator.wikimedia.org/T290591 (10RLazarus)
[16:39:57] <wikibugs>	 (03CR) 10Dzahn: thumbor: convert systemd-clean-tmpfiles cron to timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719542 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[16:40:26] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1051 - https://phabricator.wikimedia.org/T290442 (10Cmjohnson) ticket with HPE opened  Your case was successfully submitted. Please note your Case ID: 5358426636 for future reference.
[16:40:43] <wikibugs>	 (03PS2) 10Dzahn: thumbor: convert systemd-clean-tmpfiles cron to timer [puppet] - 10https://gerrit.wikimedia.org/r/719542 (https://phabricator.wikimedia.org/T273673)
[16:41:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10Cmjohnson) The first thing we will need to do  is to try and update the raid controller firmware along with the bios.
[16:41:42] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Yay! Let's test this" [puppet] - 10https://gerrit.wikimedia.org/r/719550 (https://phabricator.wikimedia.org/T289578) (owner: 10Alexandros Kosiaris)
[16:41:51] <wikibugs>	 (03CR) 10RLazarus: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus)
[16:49:07] <wikibugs>	 (03PS2) 10JMeybohm: Rakefile: Add tasks to test and diff admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/719540
[16:49:39] <wikibugs>	 (03PS1) 10Hnowlan: kube_env: Give usage when no arguments are passed [puppet] - 10https://gerrit.wikimedia.org/r/719562
[16:52:58] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps2003.codfw.wmnet
[16:53:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:20] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts maps2004.codfw.wmnet
[16:53:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Rakefile: Add tasks to test and diff admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/719540 (owner: 10JMeybohm)
[16:57:40] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=redis_maps site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:01:57] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps2004.codfw.wmnet
[17:02:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:05:36] <wikibugs>	 (03PS1) 10Dave Pifke: pipeline: include php-excimer and php-redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719565 (https://phabricator.wikimedia.org/T288165)
[17:06:22] <volans>	 hnowlan: FYI if that helps, the decom cookbooks accepts any cumin query as selection, can do multiple
[17:06:45] <hnowlan>	 oh heh, would have been handy but I'm done now 
[17:06:46] <hnowlan>	 thanks :) 
[17:07:08] <volans>	 just noticed :)
[17:07:12] <wikibugs>	 10ops-codfw, 10decommission-hardware, 10Platform Team Workboards (Platform Engineering Reliability): decommission maps2001.codfw.wmnet, maps2002.codfw.wmnet, maps2003.codfw.wmnet, maps2004.codfw.wmnet - https://phabricator.wikimedia.org/T290588 (10hnowlan)
[17:18:18] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:18:39] <wikibugs>	 (03CR) 10Legoktm: [C: 04-1] "Minor inline stuff, otherwise LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/719538 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh)
[17:19:31] <wikibugs>	 (03PS2) 10Dave Pifke: pipeline: include php-excimer extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719565 (https://phabricator.wikimedia.org/T288165)
[17:23:44] <wikibugs>	 (03CR) 10Legoktm: "Hm, it might also need adding to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/production-images/+/refs/heads/ma" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719565 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke)
[17:32:32] <wikibugs>	 (03CR) 10Volans: "Thanks for the patch! LGTM, few comments/answers inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/719389 (owner: 10RLazarus)
[17:34:38] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "At first look looks good to me, but I didn't check the consistency of all the fields added to the schema." [homer/public] - 10https://gerrit.wikimedia.org/r/674318 (https://phabricator.wikimedia.org/T272688) (owner: 10Ayounsi)
[17:36:18] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Add revscoring-editquality as first ml-service to helmfile.d (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey)
[17:40:07] <wikibugs>	 (03PS2) 10Daimona Eaytoy: Stop setting $wgAbuseFilterParserClass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680753 (https://phabricator.wikimedia.org/T239990)
[17:42:11] <wikibugs>	 (03PS4) 10Ssingh: durum: switch to client-side UUID generation [puppet] - 10https://gerrit.wikimedia.org/r/719538 (https://phabricator.wikimedia.org/T289536)
[17:42:33] <wikibugs>	 (03CR) 10Ssingh: durum: switch to client-side UUID generation (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/719538 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh)
[17:44:10] <wikibugs>	 (03PS3) 10Milimetric: analytics/data_purge: Finish renaming geoeditors_daily to editors_daily [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093)
[17:44:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] analytics/data_purge: Finish renaming geoeditors_daily to editors_daily [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093) (owner: 10Milimetric)
[17:46:27] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 04-1] "holding." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719565 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke)
[18:00:05] <jouncebot>	 Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210908T1800)
[18:00:05] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210908T1800).
[18:00:05] <jouncebot>	 Daimona: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:11] <urbanecm>	 \o
[18:00:13] <Daimona>	 o/
[18:00:17] <urbanecm>	 I can deploy today
[18:00:30] <urbanecm>	 Daimona: do you wish to test this patch at a debug server?
[18:00:56] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Stop setting $wgAbuseFilterParserClass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680753 (https://phabricator.wikimedia.org/T239990) (owner: 10Daimona Eaytoy)
[18:00:57] <Daimona>	 It shouldn't be necessary
[18:00:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10wiki_willy)
[18:00:59] <urbanecm>	 okay
[18:01:04] <Daimona>	 The config option was killed many months ago
[18:01:14] <urbanecm>	 yeah, i guess there's no way to reasonably test
[18:01:17] <urbanecm>	 i'll just sync then
[18:01:23] <Daimona>	 Thanks
[18:01:26] <urbanecm>	 Daimona: thanks for fixing the GrowthExperiments CI issue btw
[18:01:42] <wikibugs>	 (03Merged) 10jenkins-bot: Stop setting $wgAbuseFilterParserClass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680753 (https://phabricator.wikimedia.org/T239990) (owner: 10Daimona Eaytoy)
[18:01:50] <Daimona>	 No worries :) Fixing that stuff is fun
[18:03:30] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 950a377e5ba6f5d318135e31b36334532d9ae71b: Stop setting $wgAbuseFilterParserClass (T239990) (duration: 00m 58s)
[18:03:35] <urbanecm>	 Daimona: here you go
[18:03:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:36] <stashbot>	 T239990: Deprecate and then remove the old AbuseFilterParser - https://phabricator.wikimedia.org/T239990
[18:03:39] <urbanecm>	 anything else?
[18:03:48] <Daimona>	 Yay! Thank you!
[18:03:59] <urbanecm>	 any time :)
[18:05:09] <wikibugs>	 (03PS2) 10Urbanecm: Growth: Remove config that moved on-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717039 (https://phabricator.wikimedia.org/T290295)
[18:05:23] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Growth: Remove config that moved on-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717039 (https://phabricator.wikimedia.org/T290295) (owner: 10Urbanecm)
[18:06:21] <wikibugs>	 (03Merged) 10jenkins-bot: Growth: Remove config that moved on-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717039 (https://phabricator.wikimedia.org/T290295) (owner: 10Urbanecm)
[18:10:23] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: bbefce6a3778f159ad68587c830dff4a1da0c792: Growth: Remove config that moved on-wiki (T290295) (duration: 00m 58s)
[18:10:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:28] <stashbot>	 T290295: Remove on-wiki Growth config from operations/mediawiki-config - https://phabricator.wikimedia.org/T290295
[18:10:33] * urbanecm done
[18:21:03] <Jdlrobson>	 urbanecm: sorry
[18:21:07] <Jdlrobson>	 had some issues getting in IRC today
[18:21:13] <Jdlrobson>	 did I miss the window?
[18:21:28] <urbanecm>	 Jdlrobson: hi! Well, it's still ongoing and i'm happy to deploy, but...
[18:21:36] <urbanecm>	 https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210908T1800 doesn't list any patches for your name?
[18:21:38] <Jdlrobson>	 i used the wrong window again didnt it...
[18:21:57] <Jdlrobson>	 a common problem with me
[18:22:42] <urbanecm>	 looks so
[18:22:47] <Jdlrobson>	 Moved it. [config] Italian Wikipedia is now a group 1 wiki {{gerrit|715571}}
[18:23:07] <wikibugs>	 (03PS6) 10Urbanecm: Italian Wikipedia is now a group 1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson)
[18:23:09] <urbanecm>	 let's do it then!
[18:23:15] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Italian Wikipedia is now a group 1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson)
[18:23:37] <Jdlrobson>	 sweet! 
[18:23:59] <wikibugs>	 (03Merged) 10jenkins-bot: Italian Wikipedia is now a group 1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson)
[18:24:06] <urbanecm>	 i won't ask you for any tests this time, as effect is only visible when train rides 🙂
[18:24:15] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 1002 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:24:53] <Jdlrobson>	 urbanecm: Roger that!
[18:26:13] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized dblists/: 6bcbe61f9a89086b775d84a81d55a7587cf26780: Italian Wikipedia is now a group 1 wiki (T286664; 1/2) (duration: 00m 58s)
[18:26:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:26:18] <stashbot>	 T286664: Expand the list of group 1 wikis to contain at least one (preferably 2) smaller "top ten size" wikis - https://phabricator.wikimedia.org/T286664
[18:26:28] <wikibugs>	 (03Abandoned) 10Ryan Kemper: wcqs: create wcqs microsite && move gui [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper)
[18:27:28] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/config/itwiki.yaml: 6bcbe61f9a89086b775d84a81d55a7587cf26780: Italian Wikipedia is now a group 1 wiki (T286664; 2/2) (duration: 00m 58s)
[18:27:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:41] <urbanecm>	 Jdlrobson: should be live now!
[18:27:53] <Jdlrobson>	 urbanecm: sweet
[18:28:07] <Jdlrobson>	 urbanecm: I presume I should drop a note on the train task for next week. Is there anything else I should be doing?
[18:28:20] <urbanecm>	 Jdlrobson: yes, i was going to say that a note on the train task would be helpful
[18:28:26] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: When WMF staff requests to be added to ldap/wmf, also add their Phabricator account to #WMF-NDA - https://phabricator.wikimedia.org/T290605 (10Aklapper)
[18:28:41] <urbanecm>	 Jdlrobson: maybe a technews notice?
[18:29:05] <Jdlrobson>	 Yep that should have gone out already.
[18:29:28] <urbanecm>	 excellent. so i think only the train note would be enough :).
[18:31:18] <Jdlrobson>	 urbanecm: thanks for your help here!
[18:31:22] <Jdlrobson>	 I'm excited to have a a new group 1 wiki
[18:31:26] <urbanecm>	 me too!
[18:31:47] <urbanecm>	 Jdlrobson: is the NearbyPages extension working fine at beta, btw? :-)
[18:33:14] <RhinosF1>	 Nice to see itwiki sorted
[18:40:38] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: When WMF staff requests to be added to ldap/wmf, also add their Phabricator account to #WMF-NDA - https://phabricator.wikimedia.org/T290605 (10Aklapper)
[18:42:48] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: When WMF staff requests to be added to ldap/wmf, also add their Phabricator account to #WMF-NDA - https://phabricator.wikimedia.org/T290605 (10Aklapper)
[18:46:21] <wikibugs>	 (03PS8) 10Ryan Kemper: elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532
[18:50:46] <wikibugs>	 10SRE, 10ops-codfw, 10Patch-For-Review: codfw: Ship back Raritan test PDU - https://phabricator.wikimedia.org/T287762 (10Papaul) Shipped out the CPI PDU today
[18:55:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (owner: 10Ryan Kemper)
[19:01:33] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] "LGTM - feel free to merge & run if the puppetised version is good to go." [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/719041 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto)
[19:19:09] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on db1150 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1135.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:19:57] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on db2139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1186.65 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:20:23] <jynus>	 oh, that is thew new s4 source backups, which doesn't have the alerts hidden
[19:20:30] <jynus>	 will fix that so it doesn't fire again
[19:21:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:(Need By: TBD) rack/setup (4) fundraising hosts - https://phabricator.wikimedia.org/T289812 (10Jgreen)
[19:25:12] <Krinkle>	 !log krinkle@mw1369 Running some benchmarks in Eqiad on load.php
[19:25:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:27:07] <Jdlrobson>	 urbanecm: yep NearbyPages has been working fine. It's just waiting on performance review now.
[19:27:24] <urbanecm>	 Jdlrobson: good luck with perf review then :)
[19:47:13] <wikibugs>	 (03CR) 10Jbond: "thanks cwhite for the chat over irc will address comments tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/719368 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond)
[19:48:51] <wikibugs>	 (03CR) 10Jbond: puppetmaster: drop log messages from logstash reporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719368 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond)
[19:54:17] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10Krinkle)
[19:58:51] <wikibugs>	 (03PS1) 10Krinkle: Fix label of rl_css url, improve other labels, add rl_startup url [software/benchmw] - 10https://gerrit.wikimedia.org/r/719608 (https://phabricator.wikimedia.org/T280497)
[20:00:04] <jouncebot>	 chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210908T2000).
[20:00:24] <wikibugs>	 (03PS2) 10Krinkle: Fix label of rl_css url, improve other labels, add rl_startup url [software/benchmw] - 10https://gerrit.wikimedia.org/r/719608 (https://phabricator.wikimedia.org/T280497)
[20:10:08] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission asw-a-eqiad - https://phabricator.wikimedia.org/T218734 (10Jclark-ctr)
[20:10:46] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission asw-a-eqiad - https://phabricator.wikimedia.org/T218734 (10Jclark-ctr) remove from racks and preformed factory reset
[20:10:56] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission asw-a-eqiad - https://phabricator.wikimedia.org/T218734 (10Jclark-ctr) 05Open→03Resolved
[20:23:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Netbox Errors in eqiad - https://phabricator.wikimedia.org/T290364 (10Jclark-ctr) Fixed Netbox errors
[20:24:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Netbox Errors in eqiad - https://phabricator.wikimedia.org/T290364 (10Jclark-ctr) 05Open→03Resolved
[20:27:06] <wikibugs>	 (03PS1) 10Dave Pifke: fpm-multiversion-base: add php-excimer extension [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/719609 (https://phabricator.wikimedia.org/T288165)
[20:27:29] <wikibugs>	 (03Abandoned) 10Dave Pifke: pipeline: include php-excimer extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719565 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke)
[20:33:00] <Jdlrobson>	 Hey thcipriani: is https://gerrit.wikimedia.org/r/c/mediawiki/core/+/719500 blocking the train or this a phpunit error?
[20:33:02] <Jdlrobson>	 I need some context
[20:33:48] <thcipriani>	 hey Jdlrobson sorry, should have noted this was beta:  https://en.wikipedia.beta.wmflabs.org/wiki/Special:Version
[20:33:56] <thcipriani>	 ^ revert still undeployed yet
[20:37:07] <wikibugs>	 (03PS2) 10Legoktm: fpm-multiversion-base: add php-excimer extension [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/719609 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke)
[20:38:12] <wikibugs>	 (03CR) 10Legoktm: [V: 03+2 C: 03+2] "PS2: Added a changelog entry" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/719609 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke)
[20:39:24] <wikibugs>	 (03PS1) 10Dave Pifke: pipeline: add comment redirecting to correct file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719610
[20:40:03] <wikibugs>	 (03CR) 10Cwhite: puppetmaster: drop log messages from logstash reporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719368 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond)
[20:41:37] <legoktm>	 !log Successfully published image docker-registry.discovery.wmnet/php7.2-fpm-multiversion-base:1.0.2
[20:41:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:46] <dpifke>	 ^ Thanks!
[20:42:04] <legoktm>	 :) I think on the next pipeline run it should use the new image
[21:04:25] <wikibugs>	 (03PS1) 10Jdlrobson: Unset logo config rather than set to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619
[21:08:11] <icinga-wm>	 PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:09:23] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] Unset logo config rather than set to false (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 (owner: 10Jdlrobson)
[21:14:37] <wikibugs>	 (03CR) 10Jdlrobson: Unset logo config rather than set to false (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 (owner: 10Jdlrobson)
[21:16:11] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[21:17:14] <wikibugs>	 (03PS1) 10Ebernhardson: Revert "Revert "query_service: support multiple variants of wdqs microsite"" [puppet] - 10https://gerrit.wikimedia.org/r/719502
[21:21:40] <wikibugs>	 (03PS2) 10Ebernhardson: Revert "Revert "query_service: support multiple variants of wdqs microsite"" [puppet] - 10https://gerrit.wikimedia.org/r/719502
[21:22:30] <wikibugs>	 (03PS3) 10Ebernhardson: Revert "Revert "query_service: support multiple variants of wdqs microsite"" [puppet] - 10https://gerrit.wikimedia.org/r/719502 (https://phabricator.wikimedia.org/T280247)
[21:23:11] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] Unset logo config rather than set to false (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 (owner: 10Jdlrobson)
[21:37:31] <icinga-wm>	 PROBLEM - Check systemd state on ncredir3002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:43:01] <icinga-wm>	 RECOVERY - Check systemd state on ncredir3002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:45:59] <wikibugs>	 (03CR) 10Jdlrobson: "Thanks for the help here Timo. Just need to verify from CI this does the right thing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 (owner: 10Jdlrobson)
[21:46:49] <wikibugs>	 (03PS4) 10Ryan Kemper: query_service: support multiple variants of wdqs microsite [puppet] - 10https://gerrit.wikimedia.org/r/719502 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson)
[21:48:27] <icinga-wm>	 PROBLEM - Check systemd state on ncredir3002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:51:02] <wikibugs>	 (03PS2) 10Jdlrobson: Unset logo config rather than set to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619
[21:51:29] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] query_service: support multiple variants of wdqs microsite [puppet] - 10https://gerrit.wikimedia.org/r/719502 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson)
[21:52:07] <icinga-wm>	 RECOVERY - Check systemd state on ncredir3002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:53:11] <ryankemper>	 !log [WDQS] T280247 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/719502 and ran puppet-agent on `miscweb*`
[21:53:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:53:16] <stashbot>	 T280247: Create WCQS UI microsite deployment - https://phabricator.wikimedia.org/T280247
[21:55:07] <ryankemper>	 !log [WDQS] T280247 Purged varnish to make sure change took effect: `echo 'https://query-preview.wikidata.org/' | mwscript purgeList.php` and `echo 'https://query.wikidata.org/' | mwscript purgeList.php` on `mwmaint1002`
[21:55:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:57:35] <icinga-wm>	 PROBLEM - Check systemd state on ncredir3002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:04:51] <icinga-wm>	 RECOVERY - Check systemd state on ncredir3002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:05:27] <wikibugs>	 (03CR) 10Jdlrobson: [C: 04-1] "Override doesn't appear to be working so will likely to need of a better solution here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 (owner: 10Jdlrobson)
[22:08:51] <icinga-wm>	 RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:09:07] <wikibugs>	 (03PS3) 10Jdlrobson: Unset logo config rather than set to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619
[22:14:03] <icinga-wm>	 PROBLEM - Check systemd state on ganeti5001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:24:07] <ryankemper>	 !log WDQS] T280247 Ran puppet-agent on `miscweb*` following merge of https://gerrit.wikimedia.org/r/c/wikidata/query/gui-deploy/+/714623
[22:24:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:24:12] <stashbot>	 T280247: Create WCQS UI microsite deployment - https://phabricator.wikimedia.org/T280247
[22:24:33] <icinga-wm>	 PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:25:02] <icinga-wm>	 RECOVERY - Check systemd state on ganeti5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:30:31] <icinga-wm>	 PROBLEM - Check systemd state on ganeti5001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:34:09] <icinga-wm>	 RECOVERY - Check systemd state on ganeti5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:34:32] <ryankemper>	 !log WDQS] T280247 Ran puppet-agent on `miscweb*` following merge of https://gerrit.wikimedia.org/r/c/wikidata/query/gui-deploy/+/717649
[22:34:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:34:38] <stashbot>	 T280247: Create WCQS UI microsite deployment - https://phabricator.wikimedia.org/T280247
[22:38:19] <wikibugs>	 (03PS1) 10Ebernhardson: wcqs: Set admin groups and cluster to match wdqs [puppet] - 10https://gerrit.wikimedia.org/r/719643
[22:39:41] <icinga-wm>	 PROBLEM - Check systemd state on ganeti5001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:43:21] <icinga-wm>	 RECOVERY - Check systemd state on ganeti5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:57:35] <icinga-wm>	 PROBLEM - Check systemd state on ncredir3001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:00:05] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210908T2300).
[23:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[23:03:05] <icinga-wm>	 RECOVERY - Check systemd state on ncredir3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:08:37] <icinga-wm>	 PROBLEM - Check systemd state on ncredir3001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:14:07] <icinga-wm>	 RECOVERY - Check systemd state on ncredir3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:19:37] <icinga-wm>	 PROBLEM - Check systemd state on ncredir3001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:23:19] <icinga-wm>	 RECOVERY - Check systemd state on ncredir3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:25:17] <icinga-wm>	 RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:38:42] <wikibugs>	 10SRE, 10DBA, 10observability, 10Datacenter-Switchover: Database read_only alert has a changing service description - https://phabricator.wikimedia.org/T290591 (10RLazarus)
[23:39:41] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] "LGTM, one suggestion inline (a follow-up would be fine)" [puppet] - 10https://gerrit.wikimedia.org/r/719538 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh)
[23:39:44] <wikibugs>	 10SRE, 10DBA, 10observability, 10Datacenter-Switchover: Database read_only alert has a changing service description - https://phabricator.wikimedia.org/T290591 (10RLazarus) T277174 seems related too.