[00:21:17] RECOVERY - Disk space on elastic2035 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic2035&var-datasource=codfw+prometheus/ops [00:29:32] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10jhathaway) @jbond & @colewhite I put together a modest alternative proposal, I would love to hear your thoughts. == Problem == YAML is not presc... [00:50:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:55:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:16:17] PROBLEM - Disk space on elastic2035 is CRITICAL: DISK CRITICAL - free space: / 584 MB (2% inode=94%): /tmp 584 MB (2% inode=94%): /var/tmp 584 MB (2% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic2035&var-datasource=codfw+prometheus/ops [03:03:11] RECOVERY - Disk space on elastic2035 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic2035&var-datasource=codfw+prometheus/ops [04:53:22] 10SRE, 10Wikimedia-Mailing-lists: Mailing lists are not indexed by Google - https://phabricator.wikimedia.org/T299293 (10Ladsgroup) I'm sure this is intentional and historical. It's in the robots of the TLD. I will dig the discussions for it (and it doesn't mean it can not be revisited) [05:53:48] (03PS1) 10Marostegui: dbproxy1020: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/754123 (https://phabricator.wikimedia.org/T298586) [05:54:32] (03CR) 10Marostegui: [C: 03+2] dbproxy1020: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/754123 (https://phabricator.wikimedia.org/T298586) (owner: 10Marostegui) [05:55:50] (03PS1) 10Marostegui: dbproxy1016: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/754206 (https://phabricator.wikimedia.org/T298586) [05:56:52] (03CR) 10Marostegui: [C: 03+2] dbproxy1016: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/754206 (https://phabricator.wikimedia.org/T298586) (owner: 10Marostegui) [05:57:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1016.eqiad.wmnet with OS bullseye [05:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1016.eqiad.wmnet with OS bullseye [06:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:30] marostegui: no alter tables on monday?? :D [06:49:02] morning :) [06:52:41] a bit later there will be!! [06:57:50] \o/ [06:59:10] !log `systemctl reset-failed ifup@ens5.service` on an-test-client1001 and kafka-test1010 [06:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:12] RECOVERY - Check systemd state on kafka-test1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:06:58] (03CR) 10Elukey: [V: 03+1] kafka: add check to test the Broker's TLS port (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [07:24:38] PROBLEM - Check systemd state on matomo1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@matomo.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:44:03] (03PS2) 10Amire80: WIP Remove kea, nod, and sms from wmfGetVariantSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749889 (https://phabricator.wikimedia.org/T296286) [07:44:38] (03Abandoned) 10Amire80: Remove kea, nod, and sms from wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754097 (https://phabricator.wikimedia.org/T299304) (owner: 10Amire80) [07:44:48] (03CR) 10Elukey: [C: 03+2] knative-serving: add params to configure requests/limits of queue-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/753996 (https://phabricator.wikimedia.org/T296173) (owner: 10Elukey) [07:45:18] (03PS3) 10Amire80: Remove kea, nod, and sms from wmfGetVariantSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749889 (https://phabricator.wikimedia.org/T296286) [07:45:58] (03PS4) 10Amire80: Remove kea, nod, and sms from wmfGetVariantSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749889 (https://phabricator.wikimedia.org/T299304) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220117T0800) [08:06:41] (03PS1) 10Muehlenhoff: Remove LDAP access for srodlund [puppet] - 10https://gerrit.wikimedia.org/r/754450 [08:07:24] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:10:11] (03CR) 10Giuseppe Lavagetto: "I'm generally not a big fan of reformatting patches, because of how hard they make to reconstruct git history. However, they're often a ne" [puppet] - 10https://gerrit.wikimedia.org/r/754114 (owner: 10JHathaway) [08:11:07] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for srodlund [puppet] - 10https://gerrit.wikimedia.org/r/754450 (owner: 10Muehlenhoff) [08:17:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM schema1004.eqiad.wmnet [08:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:04] (03PS1) 10Marostegui: Revert "dbproxy1016: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/754127 [08:19:07] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1016: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/754127 (owner: 10Marostegui) [08:21:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM schema1004.eqiad.wmnet [08:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [08:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [08:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T285149)', diff saved to https://phabricator.wikimedia.org/P18746 and previous config saved to /var/cache/conftool/dbconfig/20220117-082638-marostegui.json [08:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:42] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [08:27:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T285149)', diff saved to https://phabricator.wikimedia.org/P18747 and previous config saved to /var/cache/conftool/dbconfig/20220117-082746-marostegui.json [08:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM schema1003.eqiad.wmnet [08:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM schema1003.eqiad.wmnet [08:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:57] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] envoy: make the choice of api version explicit [puppet] - 10https://gerrit.wikimedia.org/r/751717 (owner: 10Giuseppe Lavagetto) [08:37:04] (03PS4) 10Giuseppe Lavagetto: envoy: make the choice of api version explicit [puppet] - 10https://gerrit.wikimedia.org/r/751717 [08:42:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P18748 and previous config saved to /var/cache/conftool/dbconfig/20220117-084251-marostegui.json [08:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:32] (03PS6) 10Giuseppe Lavagetto: services_proxy::envoy: add support for v3 configuration [puppet] - 10https://gerrit.wikimedia.org/r/751718 [08:44:39] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: add support for WCQS [alerts] - 10https://gerrit.wikimedia.org/r/753915 (owner: 10DCausse) [08:45:21] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [08:46:54] (03Merged) 10jenkins-bot: rdf-streaming-updater: add support for WCQS [alerts] - 10https://gerrit.wikimedia.org/r/753915 (owner: 10DCausse) [08:49:53] (03PS1) 10Marostegui: dbproxy1017: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/754452 (https://phabricator.wikimedia.org/T298586) [08:50:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [08:52:26] (03CR) 10Marostegui: [C: 03+2] dbproxy1017: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/754452 (https://phabricator.wikimedia.org/T298586) (owner: 10Marostegui) [08:53:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1017.eqiad.wmnet with OS bullseye [08:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P18749 and previous config saved to /var/cache/conftool/dbconfig/20220117-085756-marostegui.json [08:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:55] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33267/console" [puppet] - 10https://gerrit.wikimedia.org/r/751718 (owner: 10Giuseppe Lavagetto) [09:06:00] Could someone update the channel title and add me as the on-duty person? Thanks! [09:07:47] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:08:45] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] services_proxy::envoy: add support for v3 configuration [puppet] - 10https://gerrit.wikimedia.org/r/751718 (owner: 10Giuseppe Lavagetto) [09:08:59] thanks! [09:13:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T285149)', diff saved to https://phabricator.wikimedia.org/P18750 and previous config saved to /var/cache/conftool/dbconfig/20220117-091300-marostegui.json [09:13:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [09:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [09:13:05] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [09:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T285149)', diff saved to https://phabricator.wikimedia.org/P18751 and previous config saved to /var/cache/conftool/dbconfig/20220117-091308-marostegui.json [09:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T285149)', diff saved to https://phabricator.wikimedia.org/P18752 and previous config saved to /var/cache/conftool/dbconfig/20220117-091316-marostegui.json [09:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:42] (03PS1) 10JMeybohm: Add system users to kubernetes profile [labs/private] - 10https://gerrit.wikimedia.org/r/754453 [09:23:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1017.eqiad.wmnet with OS bullseye [09:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P18753 and previous config saved to /var/cache/conftool/dbconfig/20220117-092820-marostegui.json [09:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:35] (03PS1) 10Kormat: wmfdb/log: Include logging statement location [software/wmfdb] - 10https://gerrit.wikimedia.org/r/754457 [09:30:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2143.codfw.wmnet with OS bullseye [09:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2144.codfw.wmnet with OS bullseye [09:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:19] (03CR) 10Kormat: [C: 03+2] wmfdb/log: Include logging statement location [software/wmfdb] - 10https://gerrit.wikimedia.org/r/754457 (owner: 10Kormat) [09:35:52] (03CR) 10Elukey: [C: 03+1] Add system users to kubernetes profile [labs/private] - 10https://gerrit.wikimedia.org/r/754453 (owner: 10JMeybohm) [09:36:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: grid: improve get_node_info() error reporting [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753935 (owner: 10Arturo Borrero Gonzalez) [09:37:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: grid: add cookbook to query grid node information [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753936 (owner: 10Arturo Borrero Gonzalez) [09:37:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: grid: depool_remove: fix internal hostname usage [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753921 (owner: 10Arturo Borrero Gonzalez) [09:37:28] (03Merged) 10jenkins-bot: wmfdb/log: Include logging statement location [software/wmfdb] - 10https://gerrit.wikimedia.org/r/754457 (owner: 10Kormat) [09:37:59] (03CR) 10Elukey: [C: 03+1] "LGTM! (modulo pcc run after merging the labs private change)" [puppet] - 10https://gerrit.wikimedia.org/r/753998 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [09:39:37] (03CR) 10Elukey: Migrate kube-scheduler away from insecure API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754003 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [09:40:17] (03Merged) 10jenkins-bot: wmcs: toolforge: grid: add cookbook to query grid node information [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753936 (owner: 10Arturo Borrero Gonzalez) [09:40:29] (03CR) 10Filippo Giunchedi: "LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/754029 (https://phabricator.wikimedia.org/T298994) (owner: 10Herron) [09:40:52] (03Merged) 10jenkins-bot: wmcs: toolforge: grid: depool_remove: fix internal hostname usage [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753921 (owner: 10Arturo Borrero Gonzalez) [09:41:34] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: ensure dlq directory exists [puppet] - 10https://gerrit.wikimedia.org/r/753571 (owner: 10Cwhite) [09:41:56] (03PS1) 10Giuseppe Lavagetto: envoy: start conversion to v3 configuration [puppet] - 10https://gerrit.wikimedia.org/r/754459 [09:41:58] (03PS1) 10Giuseppe Lavagetto: envoy: switch production to v3 configuration api [puppet] - 10https://gerrit.wikimedia.org/r/754460 [09:42:37] (03CR) 10Filippo Giunchedi: [C: 03+1] mcrouter::monitoring: remove module [puppet] - 10https://gerrit.wikimedia.org/r/751136 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:43:06] (03CR) 10Filippo Giunchedi: [C: 03+1] assign role::apifeatureusage::logstash to apifeatureusage[12]001 hosts [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron) [09:43:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P18754 and previous config saved to /var/cache/conftool/dbconfig/20220117-094325-marostegui.json [09:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:39] (03CR) 10JMeybohm: [V: 03+1] Migrate kube-scheduler away from insecure API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754003 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [09:49:20] (03CR) 10Elukey: [C: 03+1] "LGTM (modulo pcc etc.. as the other one :)" [puppet] - 10https://gerrit.wikimedia.org/r/754003 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [09:51:08] (03CR) 10JMeybohm: [C: 03+2] Make use-service-account-credentials the default for controller-manager [puppet] - 10https://gerrit.wikimedia.org/r/753998 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [09:51:14] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Migrate kube-scheduler away from insecure API [puppet] - 10https://gerrit.wikimedia.org/r/754003 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [09:51:36] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add system users to kubernetes profile [labs/private] - 10https://gerrit.wikimedia.org/r/754453 (owner: 10JMeybohm) [09:55:48] (03PS1) 10Filippo Giunchedi: Revert "varnish: temp ban Python-urllib/3.8" [puppet] - 10https://gerrit.wikimedia.org/r/754128 [09:58:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T285149)', diff saved to https://phabricator.wikimedia.org/P18755 and previous config saved to /var/cache/conftool/dbconfig/20220117-095830-marostegui.json [09:58:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [09:58:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [09:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:35] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [09:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:38] (03PS2) 104nn1l2: fawiki: Add flow-delete right to eliminators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753969 (https://phabricator.wikimedia.org/T299223) [09:58:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T285149)', diff saved to https://phabricator.wikimedia.org/P18756 and previous config saved to /var/cache/conftool/dbconfig/20220117-095837-marostegui.json [09:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T285149)', diff saved to https://phabricator.wikimedia.org/P18757 and previous config saved to /var/cache/conftool/dbconfig/20220117-095945-marostegui.json [09:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:56] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2143.codfw.wmnet with OS bullseye [10:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:28] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd1004.eqiad.wmnet with reason: switch to drbd storage [10:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd1004.eqiad.wmnet with reason: switch to drbd storage [10:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:04] PROBLEM - SSH on restbase2010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:04:15] !log switching kubetcd1004 to DRBD-backed storage (required for ganeti update) [10:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2144.codfw.wmnet with OS bullseye [10:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_webrequest_partitions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:45] (03PS1) 10JMeybohm: Add missing notify on kube-scheduler config change [puppet] - 10https://gerrit.wikimedia.org/r/754462 (https://phabricator.wikimedia.org/T290967) [10:14:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P18758 and previous config saved to /var/cache/conftool/dbconfig/20220117-101450-marostegui.json [10:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nick Ray - https://phabricator.wikimedia.org/T299186 (10Jelto) [10:15:12] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nick Ray - https://phabricator.wikimedia.org/T299186 (10Jelto) p:05Triage→03Medium [10:15:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1153.eqiad.wmnet with OS bullseye [10:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1152.eqiad.wmnet with OS bullseye [10:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:06] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nick Ray - https://phabricator.wikimedia.org/T299186 (10Jelto) @odimitrijevic or @Ottomata we also need approval from your side. `nray` wants to be added to group `analytics_privatedata_users`. [10:23:12] (03CR) 10Elukey: [C: 03+1] Add missing notify on kube-scheduler config change [puppet] - 10https://gerrit.wikimedia.org/r/754462 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [10:23:42] (03CR) 10JMeybohm: [C: 03+2] Add missing notify on kube-scheduler config change [puppet] - 10https://gerrit.wikimedia.org/r/754462 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [10:29:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P18759 and previous config saved to /var/cache/conftool/dbconfig/20220117-102954-marostegui.json [10:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:56] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply on staging [10:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:59] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply on production [10:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:20] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd1005.eqiad.wmnet with reason: switch to drbd storage [10:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd1005.eqiad.wmnet with reason: switch to drbd storage [10:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:28] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: sync on staging [10:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:10] !log switching kubetcd1005 to DRBD-backed storage (required for ganeti update) [10:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:43] (03PS1) 10Vgutierrez: envoyproxy: Fix non-SNI v3 setup [puppet] - 10https://gerrit.wikimedia.org/r/754467 [10:42:27] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33270/console" [puppet] - 10https://gerrit.wikimedia.org/r/754467 (owner: 10Vgutierrez) [10:42:33] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1001.eqiad.wmnet [10:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1153.eqiad.wmnet with OS bullseye [10:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:53] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1001.eqiad.wmnet [10:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T285149)', diff saved to https://phabricator.wikimedia.org/P18760 and previous config saved to /var/cache/conftool/dbconfig/20220117-104459-marostegui.json [10:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:03] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [10:45:17] (03PS8) 10Jbond: P:installserver::proxy: Add domain whitelist to proxy [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) [10:45:26] (03CR) 10Jbond: P:installserver::proxy: Add domain whitelist to proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [10:45:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1152.eqiad.wmnet with OS bullseye [10:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:06] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:40] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1002.eqiad.wmnet [10:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove recentchangeslinked group from s3 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P18761 and previous config saved to /var/cache/conftool/dbconfig/20220117-104801-marostegui.json [10:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:06] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [10:50:20] (03PS2) 10Jbond: format yaml with vinyl [puppet] - 10https://gerrit.wikimedia.org/r/754114 (owner: 10JHathaway) [10:56:44] (03CR) 10Vgutierrez: [C: 03+1] "I'd remove it now and consider permanently banning python-urllib as it doesn't comply with our UA policy" [puppet] - 10https://gerrit.wikimedia.org/r/754128 (owner: 10Filippo Giunchedi) [10:56:50] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1002.eqiad.wmnet [10:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:12] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "varnish: temp ban Python-urllib/3.8" [puppet] - 10https://gerrit.wikimedia.org/r/754128 (owner: 10Filippo Giunchedi) [10:59:19] (03CR) 10Elukey: "John the change LGTM, but I have a general question about the allowed domains. IIUC the list will limit the scope of what the analytics VL" [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [11:00:04] !log systemctl reset-failed ifup@ens5.service on kubetcd1005 T273026 [11:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:07] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [11:03:24] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd1006.eqiad.wmnet with reason: switch to drbd storage [11:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd1006.eqiad.wmnet with reason: switch to drbd storage [11:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:30] RECOVERY - SSH on restbase2010.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:04:38] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:05:28] PROBLEM - Check systemd state on ml-serve-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:12] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:07:12] RECOVERY - Check systemd state on ml-serve-ctrl1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:13] !log switching kubetcd1006 to DRBD-backed storage (required for ganeti update) [11:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:03] No deployments today or just a mistake? [11:12:22] (03PS1) 10Vgutierrez: cache::envoy: Limit number of requests per varnish conn [puppet] - 10https://gerrit.wikimedia.org/r/754471 (https://phabricator.wikimedia.org/T271421) [11:13:01] nn1l2: US holiday (https://en.wikipedia.org/wiki/Martin_Luther_King_Jr._Day) :/ [11:13:13] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33272/console" [puppet] - 10https://gerrit.wikimedia.org/r/754471 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:13:16] (so intentional) [11:13:54] (03PS1) 10Kosta Harlan: Post-edit dialog: Reload page upon dialog closing for structured tasks [extensions/GrowthExperiments] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754129 (https://phabricator.wikimedia.org/T299188) [11:15:09] Thanks, taavi. [11:15:10] 10SRE, 10SRE-Access-Requests: NRodriguez uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T299336 (10Jelto) [11:16:46] (03PS1) 10Jelto: admin: revoke natalia-rodriguez key [puppet] - 10https://gerrit.wikimedia.org/r/754472 (https://phabricator.wikimedia.org/T299336) [11:19:17] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::envoy: Limit number of requests per varnish conn [puppet] - 10https://gerrit.wikimedia.org/r/754471 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:19:46] (03PS2) 10Kosta Harlan: GrowthExperiments: Start add image experiment for desktop users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752657 (https://phabricator.wikimedia.org/T298122) [11:22:21] 10SRE, 10Infrastructure-Foundations: Write a cookbook to align the "master-capable" state of Ganeti nodes - https://phabricator.wikimedia.org/T299034 (10Volans) > With our setup, the VIP used for RAPI access by cookbooks restricts master-capable hosts to the row/VLAN for which the VIP has been created Is this... [11:23:07] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [11:23:32] (03PS1) 10Arturo Borrero Gonzalez: wmcs: vps: factorize common arguments [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754473 [11:24:20] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [11:26:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM kafkamon1002.eqiad.wmnet [11:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:41] (03PS3) 10Jbond: Hieradata: format yaml with vinyl [puppet] - 10https://gerrit.wikimedia.org/r/754114 (owner: 10JHathaway) [11:26:59] (03PS4) 10Jbond: Hieradata: format yaml with vinyl [puppet] - 10https://gerrit.wikimedia.org/r/754114 (https://phabricator.wikimedia.org/T236954) (owner: 10JHathaway) [11:30:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kafkamon1002.eqiad.wmnet [11:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:09] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/33271/" [puppet] - 10https://gerrit.wikimedia.org/r/754114 (https://phabricator.wikimedia.org/T236954) (owner: 10JHathaway) [11:32:38] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10jbond) Thanks for the work on this looks really good, in relation to linting vs automatic formatting i agree with the the conclusion that automatic... [11:34:10] (03PS2) 10Giuseppe Lavagetto: envoyproxy: Fix non-SNI v3 setup [puppet] - 10https://gerrit.wikimedia.org/r/754467 (owner: 10Vgutierrez) [11:34:12] (03PS2) 10Giuseppe Lavagetto: envoy: start conversion to v3 configuration [puppet] - 10https://gerrit.wikimedia.org/r/754459 [11:34:14] (03PS2) 10Giuseppe Lavagetto: envoy: switch production to v3 configuration api [puppet] - 10https://gerrit.wikimedia.org/r/754460 [11:36:21] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33273/console" [puppet] - 10https://gerrit.wikimedia.org/r/754459 (owner: 10Giuseppe Lavagetto) [11:38:01] (03CR) 10Jbond: "I made a few edits to the commit message, see inline for comments" [puppet] - 10https://gerrit.wikimedia.org/r/754114 (https://phabricator.wikimedia.org/T236954) (owner: 10JHathaway) [11:40:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2142.codfw.wmnet with OS bullseye [11:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:25] (03CR) 10JMeybohm: [C: 03+1] admin: revoke natalia-rodriguez key [puppet] - 10https://gerrit.wikimedia.org/r/754472 (https://phabricator.wikimedia.org/T299336) (owner: 10Jelto) [11:45:37] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33274/console" [puppet] - 10https://gerrit.wikimedia.org/r/754459 (owner: 10Giuseppe Lavagetto) [11:47:57] (03CR) 10Jbond: "thanks luca, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [11:48:00] (03CR) 10Jelto: [C: 03+2] admin: revoke natalia-rodriguez key [puppet] - 10https://gerrit.wikimedia.org/r/754472 (https://phabricator.wikimedia.org/T299336) (owner: 10Jelto) [11:49:54] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:51:06] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: NRodriguez uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T299336 (10Jelto) @NRodriguez I've merged the revocation of the SSH key used on the production cluster. This was due to it also being used in WMCS, and thus comprom... [11:54:42] (03CR) 10Giuseppe Lavagetto: [C: 03+2] envoyproxy: Fix non-SNI v3 setup [puppet] - 10https://gerrit.wikimedia.org/r/754467 (owner: 10Vgutierrez) [11:59:20] 10SRE, 10Infrastructure-Foundations: Write a cookbook to align the "master-capable" state of Ganeti nodes - https://phabricator.wikimedia.org/T299034 (10MoritzMuehlenhoff) >>! In T299034#7625593, @Volans wrote: > Is this the only reason we can't pick any host as master? AFAICS yes. As part of the codfw Ganeti... [12:00:22] 10SRE, 10Infrastructure-Foundations: Write a cookbook to align the "master-capable" state of Ganeti nodes - https://phabricator.wikimedia.org/T299034 (10Volans) >>! In T299034#7625686, @MoritzMuehlenhoff wrote: >> If so maybe we could investigate a better way to expose the RAPI access that doesn't have this li... [12:00:41] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] "Seems ok from a cursory look. I'll run puppet first on one host per class, check everything looks ok, by looking at envoy error logs and l" [puppet] - 10https://gerrit.wikimedia.org/r/754459 (owner: 10Giuseppe Lavagetto) [12:04:24] (03CR) 10Ayounsi: [C: 03+1] remove references to centrallog2001 [homer/public] - 10https://gerrit.wikimedia.org/r/754028 (https://phabricator.wikimedia.org/T298994) (owner: 10Herron) [12:10:38] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/751391 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [12:11:28] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/753684 (owner: 10Ayounsi) [12:13:02] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/753699 (owner: 10Ayounsi) [12:14:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2142.codfw.wmnet with OS bullseye [12:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:39] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [12:19:04] (03CR) 10Volans: "Comments inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/753731 (owner: 10Ayounsi) [12:19:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1151.eqiad.wmnet with OS bullseye [12:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:08] 10SRE-Access-Requests: Requesting access to analytics clients for mfossati - https://phabricator.wikimedia.org/T299343 (10mfossati) [12:49:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1151.eqiad.wmnet with OS bullseye [12:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, one questions/comment inline." [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [13:20:44] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:22:50] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:26:41] (03PS1) 10Urbanecm: pwnwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754504 (https://phabricator.wikimedia.org/T298115) [13:45:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove recentchanges group from s3 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P18762 and previous config saved to /var/cache/conftool/dbconfig/20220117-134520-marostegui.json [13:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:24] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [13:51:05] 10SRE, 10SRE-Access-Requests: Requesting access to analytics clients for mfossati - https://phabricator.wikimedia.org/T299343 (10Jelto) 05Open→03In progress p:05Triage→03Medium [13:51:56] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:05:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics clients for mfossati - https://phabricator.wikimedia.org/T299343 (10Jelto) Hi, thanks for the request. It seems that your SSH key is already registered in Wikimedia Cloud Services. Could you please provide a unique SSH key for production access? We... [14:06:47] (03CR) 10Muehlenhoff: "I like the direction, two things we might need to complement it, just throwing out some ideas:" [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [14:13:35] (03PS1) 10Jbond: nfs-mounts: Used to store facts between all nodes [puppet] - 10https://gerrit.wikimedia.org/r/754509 [14:14:41] (03PS1) 10Marostegui: db2132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/754511 (https://phabricator.wikimedia.org/T299344) [14:15:37] !log Reimage db2132 to Bullseye T299344 [14:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:41] T299344: Upgrade m1 to Bullseye - https://phabricator.wikimedia.org/T299344 [14:15:48] (03CR) 10Marostegui: [C: 03+2] db2132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/754511 (https://phabricator.wikimedia.org/T299344) (owner: 10Marostegui) [14:16:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2132.codfw.wmnet with OS bullseye [14:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:32] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:27:03] 10SRE, 10SRE-Access-Requests: Requesting access to analytics clients for mfossati - https://phabricator.wikimedia.org/T299343 (10mfossati) [14:27:18] 10SRE, 10SRE-Access-Requests: Requesting access to analytics clients for mfossati - https://phabricator.wikimedia.org/T299343 (10mfossati) Hey @Jelto, Thank you for your feedback! I'm not using that key for Wikimedia Cloud Services (VPS and Toolforge, right?), but no problem, I've generated a fresh one as you... [14:30:32] !log btullis@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM an-airflow1001.eqiad.wmnet [14:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:58] (03PS2) 10Hnowlan: restbase: remove restbase2009 [puppet] - 10https://gerrit.wikimedia.org/r/753942 (https://phabricator.wikimedia.org/T295375) [14:36:43] (03CR) 10Hnowlan: [C: 03+2] restbase: remove restbase2009 [puppet] - 10https://gerrit.wikimedia.org/r/753942 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [14:37:03] !log removing restbase2009 from cassandra configs [14:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM an-airflow1001.eqiad.wmnet [14:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:57] (03PS1) 10JMeybohm: k8s-apiserver: Disable insecure API on systems that no longer need it [puppet] - 10https://gerrit.wikimedia.org/r/754514 (https://phabricator.wikimedia.org/T290967) [14:39:59] (03PS1) 10JMeybohm: Make disabled insecure API the default on kubernetes masters [puppet] - 10https://gerrit.wikimedia.org/r/754515 (https://phabricator.wikimedia.org/T290967) [14:40:06] PROBLEM - Check systemd state on an-airflow1001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:52] !log systemctl reset-failed ifup@ens5.service on an-airflow1001 T273026 [14:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:55] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [14:42:36] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10BTullis) [14:42:46] RECOVERY - Check systemd state on an-airflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:52] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:43:22] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33275/console" [puppet] - 10https://gerrit.wikimedia.org/r/754514 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [14:44:59] !log imported cassandra 3.11.11 to component/cassandradev for stretch-wikimedia and buster-wikimedia T298805 [14:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:03] T298805: Import Debian package of Cassandra 3.11.11 as 'dev' version - https://phabricator.wikimedia.org/T298805 [14:46:44] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [14:48:16] !log btullis@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM an-airflow1002.eqiad.wmnet [14:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:24] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:49:42] (03PS1) 10Marostegui: Revert "db2132: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/754134 [14:50:25] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM an-airflow1002.eqiad.wmnet [14:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2132.codfw.wmnet with OS bullseye [14:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:01] 10SRE, 10Data-Engineering, 10Generated Data Platform, 10Platform Engineering: Import Debian package of Cassandra 3.11.11 as 'dev' version - https://phabricator.wikimedia.org/T298805 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff I've imported 3.11 for buster and stretch, enjoy :-) [14:51:03] (03CR) 10Marostegui: [C: 03+2] Revert "db2132: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/754134 (owner: 10Marostegui) [14:55:30] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10BTullis) [14:58:54] !log btullis@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM an-airflow1003.eqiad.wmnet [14:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:17] (03PS1) 10Jbond: P:installserver::proxy: switch access logs to syslog [puppet] - 10https://gerrit.wikimedia.org/r/754520 (https://phabricator.wikimedia.org/T298087) [14:59:19] (03PS1) 10Jbond: P:rsyslog: add squid to the list of programs sent to central log [puppet] - 10https://gerrit.wikimedia.org/r/754521 (https://phabricator.wikimedia.org/T298087) [14:59:51] (03CR) 10Elukey: [C: 03+1] k8s-apiserver: Disable insecure API on systems that no longer need it [puppet] - 10https://gerrit.wikimedia.org/r/754514 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [14:59:53] (03CR) 10jerkins-bot: [V: 04-1] P:installserver::proxy: switch access logs to syslog [puppet] - 10https://gerrit.wikimedia.org/r/754520 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [15:00:17] (03CR) 10jerkins-bot: [V: 04-1] P:rsyslog: add squid to the list of programs sent to central log [puppet] - 10https://gerrit.wikimedia.org/r/754521 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [15:00:56] (03CR) 10Elukey: [C: 03+1] "LGTM (modulo pcc run)" [puppet] - 10https://gerrit.wikimedia.org/r/754515 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [15:01:02] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM an-airflow1003.eqiad.wmnet [15:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:32] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10BTullis) [15:04:27] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s-apiserver: Disable insecure API on systems that no longer need it [puppet] - 10https://gerrit.wikimedia.org/r/754514 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [15:08:08] (03PS2) 10JMeybohm: Make disabled insecure API the default on kubernetes masters [puppet] - 10https://gerrit.wikimedia.org/r/754515 (https://phabricator.wikimedia.org/T290967) [15:11:11] (03PS13) 10D3r1ck01: Define a contact form for Chapter/Thorg application status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748120 (https://phabricator.wikimedia.org/T298024) [15:12:11] PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:16:53] (03PS1) 10DCausse: blazegraph: prometheus exporter may bypass nginx [puppet] - 10https://gerrit.wikimedia.org/r/754523 [15:18:55] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:19:16] (03PS2) 10DCausse: blazegraph: prometheus exporter may bypass nginx [puppet] - 10https://gerrit.wikimedia.org/r/754523 [15:19:21] (03CR) 10jerkins-bot: [V: 04-1] blazegraph: prometheus exporter may bypass nginx [puppet] - 10https://gerrit.wikimedia.org/r/754523 (owner: 10DCausse) [15:21:01] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:22:10] (03PS3) 10DCausse: blazegraph: prometheus exporter may bypass nginx [puppet] - 10https://gerrit.wikimedia.org/r/754523 [15:24:38] (03CR) 10jerkins-bot: [V: 04-1] blazegraph: prometheus exporter may bypass nginx [puppet] - 10https://gerrit.wikimedia.org/r/754523 (owner: 10DCausse) [15:29:11] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:34:42] (03PS1) 10Jelto: admin: add slopes to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/754529 (https://phabricator.wikimedia.org/T299353) [15:34:57] !log mw2271, mw2272, mw2251, mw2252 (appserver and API canaries codfw) - apt-get remove --purge fonts*; apt-get remove --purge xfonts* (T294378) [15:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:01] T294378: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 [15:35:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics clients for mfossati - https://phabricator.wikimedia.org/T299343 (10mfossati) [15:35:35] (03CR) 10jerkins-bot: [V: 04-1] admin: add slopes to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/754529 (https://phabricator.wikimedia.org/T299353) (owner: 10Jelto) [15:35:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] wmflib: add service::get_services_for function [puppet] - 10https://gerrit.wikimedia.org/r/746801 (owner: 10Giuseppe Lavagetto) [15:37:38] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Sérgio Lopes - https://phabricator.wikimedia.org/T299353 (10Jelto) 05Open→03In progress p:05Triage→03Medium Hey, thanks for the request. I prepared a change to add you to LDAP. I will reach out to you here as soon as you... [15:38:05] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:38:32] (03CR) 10Filippo Giunchedi: "Backfilling from IRC discussion, the simplest approach for now is keep using graphite-labs" [puppet] - 10https://gerrit.wikimedia.org/r/751477 (https://phabricator.wikimedia.org/T241285) (owner: 10Majavah) [15:38:37] (03CR) 10Filippo Giunchedi: "Backfilling from IRC discussion, the simplest approach for now is keep using graphite-labs" [puppet] - 10https://gerrit.wikimedia.org/r/751681 (https://phabricator.wikimedia.org/T241285) (owner: 10Majavah) [15:40:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics clients for mfossati - https://phabricator.wikimedia.org/T299343 (10mfossati) I forgot to include the need for a [[https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos#Create_a_principal_for_a_real_user|Kerberos principal]]. I've updated th... [15:40:25] RECOVERY - Check no envoy runtime configuration is left persistent on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:40:29] !log mw2278, mw2279, mw2374, mw2376 (API and jobrunner canaries codfw) - apt-get remove --purge fonts*; apt-get remove --purge xfonts* (T294378) [15:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:33] T294378: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 [15:41:14] <_joe_> puppet failing on the authdns servers is my fault, looking [15:43:10] (03PS1) 10Vgutierrez: envoy: Allow disabling circuit breakers [puppet] - 10https://gerrit.wikimedia.org/r/754532 (https://phabricator.wikimedia.org/T271421) [15:43:35] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:44:05] RECOVERY - Check no envoy runtime configuration is left persistent on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:45:18] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33276/console" [puppet] - 10https://gerrit.wikimedia.org/r/754532 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [15:46:28] !log parse2001, parse2002, wtp1025, wtp1026 (all parsoid canaries - apt-get remove --purge fonts*; apt-get remove --purge xfonts* (T294378) [15:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:32] T294378: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 [15:47:49] (03CR) 10Filippo Giunchedi: "These probes add to the existing checks, i.e. will run in parallel" [puppet] - 10https://gerrit.wikimedia.org/r/747836 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [15:48:17] (03PS1) 10Giuseppe Lavagetto: wmflib: fix get_services_for [puppet] - 10https://gerrit.wikimedia.org/r/754535 [15:48:27] (03PS2) 10Jelto: admin: add slopes to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/754529 (https://phabricator.wikimedia.org/T299353) [15:49:53] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33277/console" [puppet] - 10https://gerrit.wikimedia.org/r/754535 (owner: 10Giuseppe Lavagetto) [15:53:30] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, but quick question regarding http vs https. Is the software or our puppetization able to understand that it should use HTTPS for man" [puppet] - 10https://gerrit.wikimedia.org/r/747805 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [15:53:40] (03PS4) 10DCausse: blazegraph: prometheus exporter may bypass nginx [puppet] - 10https://gerrit.wikimedia.org/r/754523 [15:54:29] !log mw1414,mw1415,mw1416,mw1417,mw1418,mw1447,mw1448,mw1449,mw1450,mw1437,mw1438 (all canaries eqiad) - apt-get remove --purge fonts*; apt-get remove --purge xfonts* (T294378) [15:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:34] T294378: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 [15:57:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata: add zotero and helm-charts probes [puppet] - 10https://gerrit.wikimedia.org/r/747836 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [15:58:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] nfs-mounts: Used to store facts between all nodes [puppet] - 10https://gerrit.wikimedia.org/r/754509 (owner: 10Jbond) [15:59:37] (03CR) 10Filippo Giunchedi: hieradata: add more network probes for internal services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747805 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [16:00:05] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33278/console" [puppet] - 10https://gerrit.wikimedia.org/r/747836 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [16:00:33] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/754523 (owner: 10DCausse) [16:01:00] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] wmflib: fix get_services_for [puppet] - 10https://gerrit.wikimedia.org/r/754535 (owner: 10Giuseppe Lavagetto) [16:01:59] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 278 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:02:10] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add zotero and helm-charts probes [puppet] - 10https://gerrit.wikimedia.org/r/747836 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [16:02:17] (03PS4) 10Filippo Giunchedi: hieradata: add zotero and helm-charts probes [puppet] - 10https://gerrit.wikimedia.org/r/747836 (https://phabricator.wikimedia.org/T291946) [16:02:51] !log installing curl bugfix updates from bullseye 11.2 point release [16:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:19] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:07:44] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add more network probes for internal services [puppet] - 10https://gerrit.wikimedia.org/r/747805 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [16:07:50] (03PS3) 10Filippo Giunchedi: hieradata: add more network probes for internal services [puppet] - 10https://gerrit.wikimedia.org/r/747805 (https://phabricator.wikimedia.org/T291946) [16:08:02] (03PS1) 10Muehlenhoff: Add library hint for freeipmi [puppet] - 10https://gerrit.wikimedia.org/r/754539 [16:10:44] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for freeipmi [puppet] - 10https://gerrit.wikimedia.org/r/754539 (owner: 10Muehlenhoff) [16:11:58] 10SRE, 10SRE Observability, 10User-ema, 10User-fgiunchedi: rsyslog service should fail on configuration errors - https://phabricator.wikimedia.org/T290870 (10lmata) [16:12:13] 10SRE, 10SRE Observability: Grafana share button drops duplicate URL params - https://phabricator.wikimedia.org/T292606 (10lmata) [16:12:18] 10SRE, 10Analytics, 10Data-Engineering, 10Event-Platform, and 2 others: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645 (10lmata) [16:12:31] 10SRE, 10SRE Observability, 10Traffic: VarnishTrafficDrop alert false positives due to DCs depooled - https://phabricator.wikimedia.org/T291148 (10lmata) [16:12:55] 10SRE, 10SRE Observability: node_cpu_frequency_hertz metric no longer present in Bullseye - https://phabricator.wikimedia.org/T286768 (10lmata) [16:13:16] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10lmata) [16:13:25] RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:13:32] !log installing freeipmi bugfix updates from bullseye 11.2 point release [16:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:37] !log installing wget bugfix updates from bullseye 11.2 point release [16:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:36] (03PS1) 10JMeybohm: Revert "staging-codfw: Enable masquarade_all" [puppet] - 10https://gerrit.wikimedia.org/r/754140 [16:23:42] (03PS1) 10JMeybohm: Revert "staging-codfw: Advertise service cluster IPs" [deployment-charts] - 10https://gerrit.wikimedia.org/r/754141 [16:23:56] (03CR) 10jerkins-bot: [V: 04-1] Revert "staging-codfw: Enable masquarade_all" [puppet] - 10https://gerrit.wikimedia.org/r/754140 (owner: 10JMeybohm) [16:24:02] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10MoritzMuehlenhoff) [16:24:21] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10fgiunchedi) >>! In T236954#7625612, @jbond wrote: > Thanks for the work on this looks really good, in relation to linting vs automatic formatting i... [16:29:12] (03PS1) 10Muehlenhoff: Enable ganeti 2.16 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/754540 (https://phabricator.wikimedia.org/T296721) [16:29:53] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10MoritzMuehlenhoff) [16:30:16] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) 05Open→03Resolved All VMs have been restarted, thanks to everyone who helped with this! [16:30:41] !log installing python-virtualenv bugfix updates from bullseye 11.2 point release [16:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:17] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10MoritzMuehlenhoff) [16:44:09] (03PS2) 10JMeybohm: Revert "staging-codfw: Enable masquarade_all" [puppet] - 10https://gerrit.wikimedia.org/r/754140 [16:44:17] 10SRE, 10SRE-Access-Requests: Requesting access to analytics clients for mfossati - https://phabricator.wikimedia.org/T299343 (10MarkTraceur) Yea, it is approved. Thanks, all! [16:44:39] (03CR) 10JMeybohm: [C: 03+2] Revert "staging-codfw: Advertise service cluster IPs" [deployment-charts] - 10https://gerrit.wikimedia.org/r/754141 (owner: 10JMeybohm) [16:44:54] (03CR) 10jerkins-bot: [V: 04-1] Revert "staging-codfw: Enable masquarade_all" [puppet] - 10https://gerrit.wikimedia.org/r/754140 (owner: 10JMeybohm) [16:47:10] (03PS3) 10JMeybohm: Revert "staging-codfw: Enable masquarade_all" [puppet] - 10https://gerrit.wikimedia.org/r/754140 [16:48:47] (03Merged) 10jenkins-bot: Revert "staging-codfw: Advertise service cluster IPs" [deployment-charts] - 10https://gerrit.wikimedia.org/r/754141 (owner: 10JMeybohm) [16:49:19] (03CR) 10JMeybohm: [C: 03+2] Revert "staging-codfw: Enable masquarade_all" [puppet] - 10https://gerrit.wikimedia.org/r/754140 (owner: 10JMeybohm) [16:53:02] (03PS1) 10Giuseppe Lavagetto: Remove seed_image [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/754544 [16:56:00] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Please also remove the referenced file resources if not used anywhere else." [puppet] - 10https://gerrit.wikimedia.org/r/751136 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [16:57:02] (03PS1) 10Vgutierrez: cache::envoy: Disable circuit breakers [puppet] - 10https://gerrit.wikimedia.org/r/754545 (https://phabricator.wikimedia.org/T271421) [16:58:55] (03PS2) 10Vgutierrez: cache::envoy: Disable circuit breakers [puppet] - 10https://gerrit.wikimedia.org/r/754545 (https://phabricator.wikimedia.org/T271421) [16:59:51] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33281/console" [puppet] - 10https://gerrit.wikimedia.org/r/754545 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [16:59:59] (03PS4) 10Giuseppe Lavagetto: Be strict on undefined variables such as seed_image [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747060 (https://phabricator.wikimedia.org/T297619) (owner: 10Hashar) [17:03:06] (03CR) 10AOkoth: [C: 03+2] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/753680 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [17:07:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoy: Allow disabling circuit breakers [puppet] - 10https://gerrit.wikimedia.org/r/754532 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [17:07:29] (03CR) 10Giuseppe Lavagetto: [C: 03+1] cache::envoy: Disable circuit breakers [puppet] - 10https://gerrit.wikimedia.org/r/754545 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [17:08:18] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] envoy: Allow disabling circuit breakers [puppet] - 10https://gerrit.wikimedia.org/r/754532 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [17:08:52] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::envoy: Disable circuit breakers [puppet] - 10https://gerrit.wikimedia.org/r/754545 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [17:13:18] (03PS3) 10Jcrespo: mediabackups: Backup s5 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754023 (https://phabricator.wikimedia.org/T262668) [17:13:43] (03PS4) 10Jcrespo: mediabackups: Backup s5 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754023 (https://phabricator.wikimedia.org/T262668) [17:14:13] 10SRE, 10Analytics, 10Data-Engineering, 10Event-Platform, 10Sustainability (Incident Followup): Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699 (10odimitrijevic) @Ottomata is this complete? [17:16:41] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Backup s5 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754023 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [17:22:26] 10SRE-swift-storage, 10Data-Engineering, 10Data-Engineering-Kanban: Deploy research_poc Swift credidentials to Hadoop - https://phabricator.wikimedia.org/T296945 (10odimitrijevic) p:05Triage→03High @Ottomata is this complete? Should we add documentation on wikitech? [17:23:58] (03PS5) 10DCausse: blazegraph: prometheus exporter may bypass nginx [puppet] - 10https://gerrit.wikimedia.org/r/754523 [17:24:26] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/754523 (owner: 10DCausse) [17:47:03] (03PS3) 10JMeybohm: Make disabled insecure API the default on kubernetes masters [puppet] - 10https://gerrit.wikimedia.org/r/754515 (https://phabricator.wikimedia.org/T290967) [17:47:05] (03PS1) 10JMeybohm: Move multiple kubernetes keys to common with ::site variable [puppet] - 10https://gerrit.wikimedia.org/r/754551 [17:52:39] (03PS2) 10JMeybohm: Move multiple kubernetes keys to common with ::site variable [puppet] - 10https://gerrit.wikimedia.org/r/754551 [17:53:37] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33283/console" [puppet] - 10https://gerrit.wikimedia.org/r/754551 (owner: 10JMeybohm) [17:54:10] (03CR) 10JMeybohm: [V: 03+1] "Expected to be NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/754551 (owner: 10JMeybohm) [18:10:01] (03PS2) 10Arturo Borrero Gonzalez: wmcs: factorize common arguments [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754473 [18:10:03] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: introduce cookbook to repool a node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754555 (https://phabricator.wikimedia.org/T298948) [18:12:43] (03PS1) 10JMeybohm: Update codfw kubernetes master to a full node [puppet] - 10https://gerrit.wikimedia.org/r/754556 (https://phabricator.wikimedia.org/T290967) [18:22:46] 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10hashar) @Papaul ack, I have send the announcement to ops-l and [[ https... [18:46:01] !log krinkle@deploy1002 Started deploy [integration/docroot@1621c26]: (no justification provided) [18:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:15] !log krinkle@deploy1002 Finished deploy [integration/docroot@1621c26]: (no justification provided) (duration: 01m 14s) [18:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:59] (03CR) 10Ayounsi: [C: 03+2] Bump Capirca to 2.0.4 [software/homer] - 10https://gerrit.wikimedia.org/r/751391 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [18:52:46] (03Merged) 10jenkins-bot: Bump Capirca to 2.0.4 [software/homer] - 10https://gerrit.wikimedia.org/r/751391 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [19:19:59] PROBLEM - PyBal BGP sessions are established on lvs6002 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=drmrs+prometheus/ops [19:27:25] PROBLEM - PyBal BGP sessions are established on lvs6003 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=drmrs+prometheus/ops [19:59:11] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:59:25] (03PS1) 10Subramanya Sastry: Drop 'inline-media-caption' lint requests [extensions/Linter] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754144 (https://phabricator.wikimedia.org/T297443) [20:00:21] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:01:18] (03CR) 10Subramanya Sastry: "We can deploy this backport to stop accumulating inline-media-caption lints right away and not worry about the impacts of rollback of the " [extensions/Linter] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754144 (https://phabricator.wikimedia.org/T297443) (owner: 10Subramanya Sastry) [20:04:01] (03CR) 10Ladsgroup: "Do you want to backport the disabling of it as well?" [extensions/Linter] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754144 (https://phabricator.wikimedia.org/T297443) (owner: 10Subramanya Sastry) [20:09:46] (03CR) 10Subramanya Sastry: Drop 'inline-media-caption' lint requests (031 comment) [extensions/Linter] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754144 (https://phabricator.wikimedia.org/T297443) (owner: 10Subramanya Sastry) [20:10:23] (03PS1) 10Subramanya Sastry: Disable "inline-media-caption" category [extensions/Linter] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754145 (https://phabricator.wikimedia.org/T297443) [20:11:35] (03CR) 10Subramanya Sastry: Drop 'inline-media-caption' lint requests (031 comment) [extensions/Linter] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754144 (https://phabricator.wikimedia.org/T297443) (owner: 10Subramanya Sastry) [20:13:35] (03PS2) 10Jcrespo: mediabackups: Backup s6 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754024 (https://phabricator.wikimedia.org/T262668) [20:13:37] (03PS2) 10Jcrespo: mediabackups: Backup s7 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754025 (https://phabricator.wikimedia.org/T262668) [20:13:39] (03PS2) 10Jcrespo: mediabackups: Backup s8 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754026 (https://phabricator.wikimedia.org/T262668) [20:13:57] PROBLEM - SSH on restbase2010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:19:29] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:48:54] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@27a4f7a]: (no justification provided) [20:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:57] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@27a4f7a]: (no justification provided) (duration: 00m 02s) [20:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:29] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:01:35] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:22:39] PROBLEM - SSH on ms-fe2008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:23:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [21:23:15] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:38:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [22:10:13] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:16:33] RECOVERY - SSH on restbase2010.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:41:11] 10SRE, 10Wikimedia-Mailing-lists: Mailing lists are not indexed by Google - https://phabricator.wikimedia.org/T299293 (10Legoktm) Pre-Mailman3 we blocked all archives using robots.txt, but we dropped it in the migration (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/822d8e935ca63e3e6991b64... [23:06:49] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: wikidatardf-all-dumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:10:25] 10SRE, 10Observability-Alerting: Better abstractions for puppet & icinga/nagios/shinken - https://phabricator.wikimedia.org/T85624 (10lmata) [23:11:17] 10SRE, 10MediaWiki-Debug-Logger, 10Observability-Logging, 10Wikimedia-Logstash, and 2 others: MediaWiki logging & encryption - https://phabricator.wikimedia.org/T126989 (10lmata) [23:12:18] 10SRE, 10Elasticsearch, 10Observability-Logging, 10Wikimedia-Logstash, and 2 others: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335 (10lmata) [23:12:27] 10SRE, 10Analytics-Radar, 10Observability-Logging, 10Wikimedia-Logstash, and 2 others: Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10lmata) [23:12:50] 10SRE, 10Analytics, 10Data-Engineering, 10Event-Platform, and 2 others: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645 (10lmata) [23:18:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [23:18:58] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10lmata) [23:19:37] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Buster elasticsearch-curator version not compatible with ELK7 - https://phabricator.wikimedia.org/T257024 (10lmata) [23:24:21] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051 (10lmata) [23:24:40] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10User-herron: Ship MX logs to ELK - https://phabricator.wikimedia.org/T197173 (10lmata) [23:24:51] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Rationalize default logrotate "rotated" file extensions - https://phabricator.wikimedia.org/T207296 (10lmata) [23:25:05] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10serviceops: ensure httpd error logs from "misc apps" (krypton) end up in logstash - https://phabricator.wikimedia.org/T216090 (10lmata) [23:25:17] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Move iegreview from udp2log to syslog - https://phabricator.wikimedia.org/T215497 (10lmata) [23:26:07] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: elk7: fields indexed without position data; cannot run PhraseQuery - https://phabricator.wikimedia.org/T248400 (10lmata) [23:26:33] 10SRE, 10Observability-Logging, 10Privacy Engineering, 10Wikimedia-Logstash, and 2 others: Production logstash should be protected by two-factor auth, at the least - https://phabricator.wikimedia.org/T237630 (10lmata) [23:27:01] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, and 2 others: Implement sensitive logstash access control - https://phabricator.wikimedia.org/T213902 (10lmata) [23:27:11] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Investigate missing WikibaseQualityConstraints logs in logstash. - https://phabricator.wikimedia.org/T214031 (10lmata) [23:27:32] !log forced session revocation on phab for a user T299315 [23:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:41] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: pybal logs into logstash - https://phabricator.wikimedia.org/T223924 (10lmata) [23:28:45] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Ingest production logs with ELK7 - https://phabricator.wikimedia.org/T235891 (10lmata) [23:29:00] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, and 2 others: Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10lmata) [23:38:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [23:40:27] 10SRE, 10Language-Team (Language-2022-January-March): Deploy Flores MT secrets in Production for ContentTranslation - https://phabricator.wikimedia.org/T299023 (10Dzahn) 05Open→03In progress a:03Dzahn [23:42:33] 10SRE, 10Language-Team (Language-2022-January-March): Deploy Flores MT secrets in Production for ContentTranslation - https://phabricator.wikimedia.org/T299023 (10Dzahn) @KartikMistry Ok, I found the files, could decrypt them and added them to the private repo. they are available as Flores:key and Flores:secr... [23:45:45] (03CR) 10Dzahn: "cxserver/staging.yaml , cxserver/eqiad.yaml , cxserver/codfw.yaml on deploy1002 now have:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/751547 (https://phabricator.wikimedia.org/T298584) (owner: 10KartikMistry) [23:46:07] 10SRE, 10Language-Team (Language-2022-January-March): Deploy Flores MT secrets in Production for ContentTranslation - https://phabricator.wikimedia.org/T299023 (10Dzahn) 05In progress→03Resolved [23:49:52] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10serviceops: ensure httpd error logs from "misc apps" (krypton) end up in logstash - https://phabricator.wikimedia.org/T216090 (10Dzahn) A few things have changed since I created this ticket. scholarships, the app mentioned, has bee... [23:52:17] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Backup s6 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754024 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo)