[00:05:55] PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-05-17 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:11:11] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:14:39] 10SRE, 10DBA, 10Performance-Team, 10Patch-For-Review, 10Sustainability (MediaWiki-MultiDC): Perform testing for TLS effect on connection rate - https://phabricator.wikimedia.org/T171071 (10Krinkle) [00:15:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [00:15:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [00:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T298560)', diff saved to https://phabricator.wikimedia.org/P28462 and previous config saved to /var/cache/conftool/dbconfig/20220525-001552-ladsgroup.json [00:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:01] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [00:16:57] 10SRE, 10DBA, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): App servers <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809 (10Krinkle) [00:21:25] PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-05-17 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:22:19] (03PS1) 10AGueyte: Deploy SimilarEditors to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799010 (https://phabricator.wikimedia.org/T306908) [00:22:47] (03Abandoned) 10AGueyte: Deploy SimilarEditors to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799010 (https://phabricator.wikimedia.org/T306908) (owner: 10AGueyte) [00:24:03] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [00:30:25] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [00:35:58] (03PS1) 10Cathal Mooney: Change order that Netbox server provision script gets old/new vlan name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/799011 (https://phabricator.wikimedia.org/T304936) [00:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [00:36:55] (03PS1) 10AGueyte: Deploy SimilarEditors to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799012 (https://phabricator.wikimedia.org/T306908) [00:42:13] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:47:34] (03PS1) 10Cathal Mooney: Remove DHCP option 82 insertion for all l3_switches [homer/public] - 10https://gerrit.wikimedia.org/r/799013 (https://phabricator.wikimedia.org/T304936) [00:52:41] RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2022-05-24 00:00:01 (3052 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:02:58] (KubernetesRsyslogDown) firing: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:06:03] (03CR) 10Cathal Mooney: [C: 03+2] Remove DHCP option 82 insertion for all l3_switches [homer/public] - 10https://gerrit.wikimedia.org/r/799013 (https://phabricator.wikimedia.org/T304936) (owner: 10Cathal Mooney) [01:06:46] (03Merged) 10jenkins-bot: Remove DHCP option 82 insertion for all l3_switches [homer/public] - 10https://gerrit.wikimedia.org/r/799013 (https://phabricator.wikimedia.org/T304936) (owner: 10Cathal Mooney) [01:08:31] RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-05-24 00:00:02 (3052 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:30:06] 10SRE, 10DBA, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): App servers <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809 (10tstarling) TLS connections from MediaWiki app servers to MariaDB appear to work just fine. You just pass flags=DBO_SSL a... [01:32:22] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:34:39] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:40:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [01:45:17] (03PS1) 10Dzahn: gitlab: switch backup location to /srv, don't use /etc [puppet] - 10https://gerrit.wikimedia.org/r/799016 (https://phabricator.wikimedia.org/T274463) [01:46:00] (03CR) 10CI reject: [V: 04-1] gitlab: switch backup location to /srv, don't use /etc [puppet] - 10https://gerrit.wikimedia.org/r/799016 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [01:55:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:02:47] !log restart elasticsearch_6@production-search-psi-eqiad to resolve CirrusSearchJVMGCOldPoolFlatlined alert [02:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:01] (CirrusSearchJVMGCOldPoolFlatlined) resolved: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [02:12:58] (KubernetesRsyslogDown) firing: (7) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:35:42] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:06:46] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:14:36] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:23:06] PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Certificate gerrit.wikimedia.org expires in 3 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000). https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [03:26:56] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:27:30] RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [03:42:24] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:42:36] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:51:39] (03PS3) 10KartikMistry: Enable Content and Section Translation in Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797977 (https://phabricator.wikimedia.org/T304858) [03:56:34] (03PS4) 10KartikMistry: Enable Content and Section Translation in Serbian and Zulu Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797977 (https://phabricator.wikimedia.org/T304834) [03:57:31] (03CR) 10CI reject: [V: 04-1] Enable Content and Section Translation in Serbian and Zulu Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797977 (https://phabricator.wikimedia.org/T304834) (owner: 10KartikMistry) [04:01:43] (03PS5) 10KartikMistry: Enable Content and Section Translation in Serbian and Zulu Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797977 (https://phabricator.wikimedia.org/T304834) [04:07:56] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:28:04] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:42:13] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:56:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298560)', diff saved to https://phabricator.wikimedia.org/P28463 and previous config saved to /var/cache/conftool/dbconfig/20220525-045612-ladsgroup.json [04:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:20] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [04:57:04] !log Rename revision_actor_temp on s4 T307906 [04:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:11] T307906: Drop revision_actor_temp in production - https://phabricator.wikimedia.org/T307906 [04:59:02] PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Certificate gerrit.wikimedia.org expires in 3 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000). https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [05:00:16] RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [05:03:12] !log Rename revision_actor_temp on s2 T307906 [05:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:18] T307906: Drop revision_actor_temp in production - https://phabricator.wikimedia.org/T307906 [05:05:39] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depool db1143', diff saved to https://phabricator.wikimedia.org/P28464 and previous config saved to /var/cache/conftool/dbconfig/20220525-050538-marostegui.json [05:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:12] !log Install 10.4.25 on db1143 T308915 [05:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:17] T308915: Prepare 10.4.25 and 10.6.8 Mariadb packages - https://phabricator.wikimedia.org/T308915 [05:07:58] (KubernetesRsyslogDown) firing: (6) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:08:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 1%: After migrating to 10.4.25', diff saved to https://phabricator.wikimedia.org/P28465 and previous config saved to /var/cache/conftool/dbconfig/20220525-050833-root.json [05:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:05] (03PS1) 10Marostegui: control-mariadb-10.4-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/799019 (https://phabricator.wikimedia.org/T308915) [05:11:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P28466 and previous config saved to /var/cache/conftool/dbconfig/20220525-051117-ladsgroup.json [05:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:58] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.4-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/799019 (https://phabricator.wikimedia.org/T308915) (owner: 10Marostegui) [05:12:55] (03Merged) 10jenkins-bot: control-mariadb-10.4-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/799019 (https://phabricator.wikimedia.org/T308915) (owner: 10Marostegui) [05:14:49] !log Rename revision_actor_temp on s7 T307906 [05:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:54] T307906: Drop revision_actor_temp in production - https://phabricator.wikimedia.org/T307906 [05:16:50] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:18:41] !log Rename revision_actor_temp on s5 T307906 [05:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 5%: After migrating to 10.4.25', diff saved to https://phabricator.wikimedia.org/P28467 and previous config saved to /var/cache/conftool/dbconfig/20220525-052336-root.json [05:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P28468 and previous config saved to /var/cache/conftool/dbconfig/20220525-052622-ladsgroup.json [05:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:44] !log Rename revision_actor_temp on s1 T307906 [05:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:50] T307906: Drop revision_actor_temp in production - https://phabricator.wikimedia.org/T307906 [05:38:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 10%: After migrating to 10.4.25', diff saved to https://phabricator.wikimedia.org/P28469 and previous config saved to /var/cache/conftool/dbconfig/20220525-053840-root.json [05:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298560)', diff saved to https://phabricator.wikimedia.org/P28470 and previous config saved to /var/cache/conftool/dbconfig/20220525-054127-ladsgroup.json [05:41:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [05:41:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [05:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:33] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [05:41:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T298560)', diff saved to https://phabricator.wikimedia.org/P28471 and previous config saved to /var/cache/conftool/dbconfig/20220525-054135-ladsgroup.json [05:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 25%: After migrating to 10.4.25', diff saved to https://phabricator.wikimedia.org/P28472 and previous config saved to /var/cache/conftool/dbconfig/20220525-055344-root.json [05:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:59] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:03:52] Static seems down [06:07:25] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 26021 bytes in 0.298 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:08:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 50%: After migrating to 10.4.25', diff saved to https://phabricator.wikimedia.org/P28473 and previous config saved to /var/cache/conftool/dbconfig/20220525-060848-root.json [06:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:43] (03PS4) 10Abijeet Patro: Add namespaces to Punjabi wikisource default search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793799 (https://phabricator.wikimedia.org/T287887) [06:20:23] RECOVERY - Check systemd state on an-tool1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:20:38] !log `elukey@an-tool1011:~$ sudo systemctl reset-failed ifup@ens13.service` - T273026 [06:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:46] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [06:22:51] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:23:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 75%: After migrating to 10.4.25', diff saved to https://phabricator.wikimedia.org/P28474 and previous config saved to /var/cache/conftool/dbconfig/20220525-062352-root.json [06:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:04] (03CR) 10Elukey: "Hi folks! Sorry late to the party but adding two comments:" [puppet] - 10https://gerrit.wikimedia.org/r/792116 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [06:38:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 100%: After migrating to 10.4.25', diff saved to https://phabricator.wikimedia.org/P28475 and previous config saved to /var/cache/conftool/dbconfig/20220525-063856-root.json [06:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:44] (03CR) 10Slyngshede: [V: 03+1] "Handle the problem caused by 792116, where the systemd timer continuously fails, due to fair scheduler not actually being used currently." [puppet] - 10https://gerrit.wikimedia.org/r/799257 (owner: 10Slyngshede) [07:10:50] * kart_ deploying first patch.. [07:11:32] logmsgbot seems not back.. :/ [07:11:51] !log Config: [[gerrit:797977|Enable Content and Section Translation in Serbian and Zulu Wikipedias (T304834 T304858)]] [07:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:03] T304834: Enable Content and Section Translation for Zulu Wikipedia - https://phabricator.wikimedia.org/T304834 [07:12:03] T304858: Enable Content and Section Translation for Serbian Wikipedia - https://phabricator.wikimedia.org/T304858 [07:13:28] OK. 2nd patch requires manual rebase. On it.. [07:15:49] (03PS2) 10KartikMistry: Enable Section Translation for Hindi in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798389 (https://phabricator.wikimedia.org/T308834) [07:17:46] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall, see inline for comments. +1'd anyways as it is better to start backups soon and we can iterate" [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite) [07:18:01] (03CR) 10KartikMistry: [C: 03+2] "UTC morning backport deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798389 (https://phabricator.wikimedia.org/T308834) (owner: 10KartikMistry) [07:19:10] (03Merged) 10jenkins-bot: Enable Section Translation for Hindi in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798389 (https://phabricator.wikimedia.org/T308834) (owner: 10KartikMistry) [07:19:43] (03CR) 10Elukey: Disable cleanup on unused Fairscheduler for Hadoop. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799257 (owner: 10Slyngshede) [07:20:58] Testing 2nd patch on mwdebug1001 [07:21:50] .. and deploying.. [07:23:10] !log Config: [[gerrit:798389|Enable Section Translation for Hindi in testwiki (T308834)]] [07:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:16] T308834: Enable Section Translation on some wikis while Content Translation remains in beta - https://phabricator.wikimedia.org/T308834 [07:24:05] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:24:49] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks, merging." [puppet] - 10https://gerrit.wikimedia.org/r/797362 (owner: 10Zabe) [07:29:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:09] (03PS3) 10Muehlenhoff: gitlab/gitlab_runner: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793399 (https://phabricator.wikimedia.org/T308013) [07:30:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:30:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [07:31:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [07:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:04] (03CR) 10Muehlenhoff: [C: 03+2] gitlab/gitlab_runner: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793399 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:36:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1133.eqiad.wmnet with reason: Maintenance [07:36:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1133.eqiad.wmnet with reason: Maintenance [07:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [07:36:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [07:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [07:41:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [07:41:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T303603)', diff saved to https://phabricator.wikimedia.org/P28476 and previous config saved to /var/cache/conftool/dbconfig/20220525-074205-ladsgroup.json [07:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:15] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [07:47:21] (03PS1) 10Ladsgroup: logging: Add index hint when asking for a specific user [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798814 (https://phabricator.wikimedia.org/T303089) [07:48:44] (03PS1) 10Ladsgroup: logging: Add index hint when asking for a specific user [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/798815 (https://phabricator.wikimedia.org/T303089) [07:48:50] (03CR) 10Kosta Harlan: "recheck" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/799008 (https://phabricator.wikimedia.org/T309152) (owner: 10MewOphaswongse) [07:48:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T303603)', diff saved to https://phabricator.wikimedia.org/P28477 and previous config saved to /var/cache/conftool/dbconfig/20220525-074856-ladsgroup.json [07:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:05] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [07:49:25] am I too late for a backport? [07:49:54] kart_: are you finished? [07:51:01] (03PS1) 10Slyngshede: Run a public and private repo on a single host. [puppet] - 10https://gerrit.wikimedia.org/r/799264 [07:51:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2021.codfw.wmnet [07:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:33] ^ kubetcd2006 will go down temporarily [07:52:14] (03PS1) 10Ladsgroup: Set templatelinks migration to read new everywhere except three wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799265 (https://phabricator.wikimedia.org/T306673) [07:52:47] Amir1 / urbanecm are you ok with me deploying a patch now? [07:52:59] fine by me [07:53:33] how can I work around the (unrelated) test failures in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/799008 ? It is a known issue that from time to time the parser tests break with GrowthExperiments (and other extensions that have a dependency on parsoid in CI) [07:53:40] kostajh: yeah. Sorry. [07:54:17] PROBLEM - Host kubetcd2006 is DOWN: PING CRITICAL - Packet loss = 100% [07:54:48] (03CR) 10Muehlenhoff: Run a public and private repo on a single host. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799264 (owner: 10Slyngshede) [07:55:18] (03CR) 10Kosta Harlan: [C: 03+2] "backport" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/799008 (https://phabricator.wikimedia.org/T309152) (owner: 10MewOphaswongse) [07:55:29] RECOVERY - Host kubetcd2006 is UP: PING OK - Packet loss = 0%, RTA = 31.77 ms [07:57:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2021.codfw.wmnet [07:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:55] (03PS2) 10Slyngshede: WIP: Run a public and private repo on a single host. [puppet] - 10https://gerrit.wikimedia.org/r/799264 [07:59:18] waiting for jenkins... [07:59:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [07:59:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [08:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:05] dancy and jnuche: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220525T0800). [08:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:10] (03CR) 10Slyngshede: WIP: Run a public and private repo on a single host. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799264 (owner: 10Slyngshede) [08:01:33] (03CR) 10Jcrespo: "Note the currently configured bacula fileset:" [puppet] - 10https://gerrit.wikimedia.org/r/799016 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [08:01:53] jnuche: still working on a backport [08:02:02] (03CR) 10Kosta Harlan: [C: 03+2] "backport" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/799007 (https://phabricator.wikimedia.org/T309152) (owner: 10MewOphaswongse) [08:02:32] kostajh: πŸ‘ [08:02:46] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:02:58] (KubernetesRsyslogDown) firing: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:03:55] 10SRE, 10Wikimedia-Mailing-lists: Mailing list for Turkic Wikimedians - https://phabricator.wikimedia.org/T309155 (10Ladsgroup) 05Openβ†’03Resolved a:03Ladsgroup I created it. [08:04:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P28478 and previous config saved to /var/cache/conftool/dbconfig/20220525-080403-ladsgroup.json [08:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:33] (03CR) 10Slyngshede: [C: 03+2] P:aptrepo::private: update to use httpd listen_ports [puppet] - 10https://gerrit.wikimedia.org/r/798617 (owner: 10Jbond) [08:08:07] (03CR) 10Slyngshede: [C: 03+2] P:aptrepo::private: update to use httpd listen_ports (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/798617 (owner: 10Jbond) [08:08:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T298560)', diff saved to https://phabricator.wikimedia.org/P28479 and previous config saved to /var/cache/conftool/dbconfig/20220525-080829-ladsgroup.json [08:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:34] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [08:09:35] (03PS18) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 [08:10:18] (03CR) 10CI reject: [V: 04-1] Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup) [08:12:13] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10MatthewVernon) Thanks for the update, and thank you for persevering! [08:12:47] (03PS19) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 [08:14:06] (03CR) 10CI reject: [V: 04-1] Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup) [08:15:26] (03PS20) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 [08:15:56] jouncebot: nowandnext [08:15:57] For the next 1 hour(s) and 44 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220525T0800) [08:15:57] In 4 hour(s) and 44 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220525T1300) [08:16:29] still waitiing on jenkins [08:17:30] It seems we might need to deploy train blocker. [08:18:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host schema2003.codfw.wmnet [08:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:43] (03CR) 10CI reject: [V: 04-1] Add an image: Attach view image details button to .mw-ge-recommendedImage-imageWrapper [extensions/GrowthExperiments] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/799008 (https://phabricator.wikimedia.org/T309152) (owner: 10MewOphaswongse) [08:19:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132 to install 10.6.8', diff saved to https://phabricator.wikimedia.org/P28480 and previous config saved to /var/cache/conftool/dbconfig/20220525-081901-marostegui.json [08:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P28481 and previous config saved to /var/cache/conftool/dbconfig/20220525-081908-ladsgroup.json [08:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:05] Amir1: as expected, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/799008/ failed due to unrelated parser integration tests. what should I do? I could leave it and just backport wmf.13 since that will be in group2 tomorrow [08:20:32] if I want to do wmf.12, do I need to set verified +2 and CR +2 for it to force merge? (I haven't done that before) [08:20:49] hmm, it's sorta weird as backports of mine to wmf.12 all pass but it's probably a dependency [08:21:05] if it's a user facing issue and sorta can't wait, it's fine to force merge it [08:22:20] (03CR) 10Kosta Harlan: [V: 03+2 C: 03+2] "parser test errors are unrelated" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/799008 (https://phabricator.wikimedia.org/T309152) (owner: 10MewOphaswongse) [08:22:46] Amir1: so, I press "submit"? [08:22:55] kostajh: ja [08:23:13] thx [08:25:13] Amir1: Is it OK to deploy fix for https://phabricator.wikimedia.org/T309151 now? We still have sometime for train run. [08:25:26] Need wmf.12 and wmf.13 backports. [08:25:45] kart_: I don't know if the train for this is going in EU time or US time [08:25:51] jnuche: maybe you know? [08:26:01] if it's going in US time, sure then [08:26:16] amir1: yeah, today it's US time [08:26:31] awesome. MAYHEM [08:26:38] oh noes [08:26:55] kart_: you go first, I think I have around five or six patches to deploy :/ [08:27:16] :D I still need to backport though. CI Mayhem. [08:27:19] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host schema2003.codfw.wmnet [08:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:41] PROBLEM - Check systemd state on schema2003 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:28:46] kart_: you +2, I deploy some config changes in the mean time, I also have backports [08:28:46] (03PS3) 10Majavah: nrpe: manage sudo rules via nrpe::check [puppet] - 10https://gerrit.wikimedia.org/r/797422 [08:29:09] (03CR) 10Ladsgroup: [C: 03+2] Set templatelinks migration to read new everywhere except three wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799265 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [08:29:26] syncing the wmf.12 patch now [08:29:29] (03CR) 10Majavah: nrpe: manage sudo rules via nrpe::check (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah) [08:29:30] then doing wmf.13 [08:29:45] !log kharlan@deploy1002 Synchronized php-1.39.0-wmf.12/extensions/GrowthExperiments/modules/ext.growthExperiments.StructuredTask/addimage: Backport: [[gerrit:799008|Add an image: Attach view image details button to .mw-ge-recommendedImage-imageWrapper (T309152)]] (duration: 00m 51s) [08:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:51] T309152: [regression-wmf.13] Add image: Missing "View image details" link - https://phabricator.wikimedia.org/T309152 [08:30:01] (03Merged) 10jenkins-bot: Set templatelinks migration to read new everywhere except three wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799265 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [08:30:17] marostegui: cc, I'm setting all wikis to **read** new for templatelinks except three wikis: enwiki, commonswiki, zhwiki [08:30:30] well, still waiting on wmf.13 jenkins to finish [08:30:44] (03PS1) 10KartikMistry: EditSummariesAid: Check if title exists before further processing [extensions/Translate] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798816 (https://phabricator.wikimedia.org/T309151) [08:30:45] Amir1: oki [08:31:05] kostajh: In the mean time, I sneakily deploy small stuff [08:31:20] yeah it seems like there is still some time to go :\ [08:31:25] (03CR) 10Abijeet Patro: [C: 03+1] EditSummariesAid: Check if title exists before further processing [extensions/Translate] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798816 (https://phabricator.wikimedia.org/T309151) (owner: 10KartikMistry) [08:31:27] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35534/console" [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah) [08:32:01] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:799265|Set templatelinks migration to read new everywhere except three wikis (T306673)]] (duration: 00m 50s) [08:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:05] Amir1: go ahead. I'll deploy on wmf.13 once CI is done. [08:32:07] T306673: Turn on read new for templatelinks on beta and production - https://phabricator.wikimedia.org/T306673 [08:32:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:24] (03PS21) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 [08:32:31] (03CR) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup) [08:32:33] (03PS1) 10Majavah: hieradata: purge stale sudoers.d entries [puppet] - 10https://gerrit.wikimedia.org/r/799268 [08:32:35] (03CR) 10Ladsgroup: [C: 03+2] Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup) [08:32:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:33:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:31] (03Merged) 10jenkins-bot: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup) [08:33:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host schema2004.codfw.wmnet [08:33:37] kostajh: generally it would be great if you could take out extension config out of IS.php. See https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/793873 as an example [08:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:46] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35535/console" [puppet] - 10https://gerrit.wikimedia.org/r/799268 (owner: 10Majavah) [08:34:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T303603)', diff saved to https://phabricator.wikimedia.org/P28482 and previous config saved to /var/cache/conftool/dbconfig/20220525-083413-ladsgroup.json [08:34:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [08:34:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [08:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T303603)', diff saved to https://phabricator.wikimedia.org/P28483 and previous config saved to /var/cache/conftool/dbconfig/20220525-083421-ladsgroup.json [08:34:21] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [08:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:35] RECOVERY - Check systemd state on schema2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:35:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema2004.codfw.wmnet [08:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:46] !log ladsgroup@deploy1002 Synchronized wmf-config/ext-ORES.php: Config: [[gerrit:793873|Move out ORES extension configuration out of InitialiseSettings.php]], Part 1/3 (duration: 00m 50s) [08:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:44] !log ladsgroup@deploy1002 Synchronized multiversion/MWConfigCacheGenerator.php: Config: [[gerrit:793873|Move out ORES extension configuration out of InitialiseSettings.php]], Part 2/3 (duration: 00m 50s) [08:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 1%: After migrating to 10.6.8', diff saved to https://phabricator.wikimedia.org/P28484 and previous config saved to /var/cache/conftool/dbconfig/20220525-083659-root.json [08:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host schema1003.eqiad.wmnet [08:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:42] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:793873|Move out ORES extension configuration out of InitialiseSettings.php]], Part 3/3 (duration: 00m 49s) [08:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:27] I'm done with config deploys [08:38:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:23] (03PS2) 10Jcrespo: mariadb::misc: Fix motd that was marking misc hosts as core [puppet] - 10https://gerrit.wikimedia.org/r/798467 [08:39:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:39:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:18] (03CR) 10Jcrespo: [C: 03+2] mariadb::misc: Fix motd that was marking misc hosts as core [puppet] - 10https://gerrit.wikimedia.org/r/798467 (owner: 10Jcrespo) [08:40:39] (03PS1) 10Zabe: Acquire fresh actor id [extensions/CheckUser] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/798817 (https://phabricator.wikimedia.org/T233004) [08:40:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:40:42] Amir1: is there a task tracking that? What is the motivation, just to have smaller files instead of one huge one? [08:40:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1003.eqiad.wmnet [08:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:09] (03PS1) 10Zabe: Acquire fresh actor id [extensions/CheckUser] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798818 (https://phabricator.wikimedia.org/T233004) [08:41:19] kostajh: https://phabricator.wikimedia.org/T308932 [08:42:13] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:42:24] ty [08:42:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host schema1004.eqiad.wmnet [08:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:43] Amir1: CI still says 5 min for and probably 15 more minute after I +2 :) [08:43:34] yup [08:43:42] (03CR) 10Volans: [C: 03+1] "Although this means that in most cases we'll do 2 queries for the new name that will fail and then the old one that will succeed, it seems" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/799011 (https://phabricator.wikimedia.org/T304936) (owner: 10Cathal Mooney) [08:43:44] just +2 it and save time [08:43:45] :D [08:44:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1004.eqiad.wmnet [08:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:40] (03PS1) 10Zabe: Fix phan failure PhanPluginSimplifyExpressionBool [extensions/CheckUser] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798819 [08:45:11] (03PS2) 10Zabe: Acquire fresh actor id [extensions/CheckUser] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798818 (https://phabricator.wikimedia.org/T233004) [08:46:19] Amir1: :/ [08:48:55] (03Merged) 10jenkins-bot: Add an image: Attach view image details button to .mw-ge-recommendedImage-imageWrapper [extensions/GrowthExperiments] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/799007 (https://phabricator.wikimedia.org/T309152) (owner: 10MewOphaswongse) [08:50:25] (03CR) 10KartikMistry: [C: 03+2] "Emergency wmf.13 backport deploy." [extensions/Translate] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798816 (https://phabricator.wikimedia.org/T309151) (owner: 10KartikMistry) [08:50:37] Amir1: CI Passed, did +2. [08:51:03] syncing the wmf.13 patch now [08:51:42] !log kharlan@deploy1002 Synchronized php-1.39.0-wmf.13/extensions/GrowthExperiments/modules/ext.growthExperiments.StructuredTask/addimage: Backport: [[gerrit:799007|Add an image: Attach view image details button to .mw-ge-recommendedImage-imageWrapper (T309152)]] (duration: 00m 49s) [08:51:45] (03CR) 10Muehlenhoff: [C: 03+2] Allow new idp-test hosts in Ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/798709 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [08:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:49] T309152: [regression-wmf.13] Add image: Missing "View image details" link - https://phabricator.wikimedia.org/T309152 [08:51:53] ok, I'm done. Sorry that took so long [08:52:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 5%: After migrating to 10.6.8', diff saved to https://phabricator.wikimedia.org/P28485 and previous config saved to /var/cache/conftool/dbconfig/20220525-085203-root.json [08:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:32] (03PS1) 10Ladsgroup: Move CirrusSearch settings from IS.php to ext-CirrusSearch.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799272 (https://phabricator.wikimedia.org/T308932) [08:55:41] (03CR) 10CI reject: [V: 04-1] Move CirrusSearch settings from IS.php to ext-CirrusSearch.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799272 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [08:55:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:56:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:56:47] (03PS2) 10Ladsgroup: Move CirrusSearch settings from IS.php to ext-CirrusSearch.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799272 (https://phabricator.wikimedia.org/T308932) [08:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T303603)', diff saved to https://phabricator.wikimedia.org/P28486 and previous config saved to /var/cache/conftool/dbconfig/20220525-090028-ladsgroup.json [09:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:33] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [09:01:36] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [09:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:33] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:06:35] (03PS1) 10Filippo Giunchedi: prometheus: clarify and document 'timeout' service::catalog probe option [puppet] - 10https://gerrit.wikimedia.org/r/799276 (https://phabricator.wikimedia.org/T309107) [09:06:37] (03PS1) 10Filippo Giunchedi: Revert "hieradata: temp disable paging for thumbor" [puppet] - 10https://gerrit.wikimedia.org/r/799277 (https://phabricator.wikimedia.org/T309107) [09:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: After migrating to 10.6.8', diff saved to https://phabricator.wikimedia.org/P28487 and previous config saved to /var/cache/conftool/dbconfig/20220525-090707-root.json [09:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:01] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host eventlog1003.eqiad.wmnet [09:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:39] Amir1: few more minutes for me.. [09:09:57] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:10:00] ok I've sorted the probes timeout thing from yesterday, now with https://gerrit.wikimedia.org/r/c/operations/puppet/+/799276 things will DTRT [09:10:11] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host eventlog1003.eqiad.wmnet [09:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:29] kart_: don't worry, the thing I'm fixing has been broken for weeks, can stay broken for a while longer ;) [09:10:39] :D [09:11:16] Patch is about to merge, will deploy once it is merged. [09:12:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dborch1001.wikimedia.org [09:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:52] (03Merged) 10jenkins-bot: EditSummariesAid: Check if title exists before further processing [extensions/Translate] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798816 (https://phabricator.wikimedia.org/T309151) (owner: 10KartikMistry) [09:15:25] !log deployed new firewall rule on all sites [09:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P28488 and previous config saved to /var/cache/conftool/dbconfig/20220525-091533-ladsgroup.json [09:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:11] OK. Patch passed +2 CI. abijeet will ping you once we've patch on mwdebug1001. [09:16:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dborch1001.wikimedia.org [09:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:02] 10SRE, 10MediaWiki-Uploading, 10Structured Data Engineering, 10Structured-Data-Backlog, and 3 others: LocalFile::prerenderThumbnail should have a page limit - https://phabricator.wikimedia.org/T309114 (10Joe) p:05Triageβ†’03High I would say this probably should be backported, given this is a fix for an o... [09:17:19] OK [09:17:48] abijeet: You can now test it on mwdebug1001. [09:18:06] OK [09:18:20] (03CR) 10Giuseppe Lavagetto: [C: 03+1] lvs: stop double-checking docker registry from Icinga [puppet] - 10https://gerrit.wikimedia.org/r/793815 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:20:06] kart_, looks good. [09:20:25] abijeet: awesome. Deploying.. [09:20:38] 10SRE, 10Wikimedia-Incident: text-https:443 has failed probes (retrospective task) - https://phabricator.wikimedia.org/T309178 (10jbond) [09:21:00] (03PS1) 10Slyngshede: WIP: Run multiple apt repos on a single host. [puppet] - 10https://gerrit.wikimedia.org/r/799279 [09:21:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1130', diff saved to https://phabricator.wikimedia.org/P28489 and previous config saved to /var/cache/conftool/dbconfig/20220525-092101-marostegui.json [09:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:31] !log kartik@deploy1002 Synchronized php-1.39.0-wmf.13/extensions/Translate/src/TranslatorInterface/Aid/EditSummariesAid.php: Backport: [[gerrit:798816|EditSummariesAid: Check if title exists before further processing (T309151)]] (duration: 00m 49s) [09:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:36] T309151: MediaWiki\Revision\RevisionAccessException: Could not determine title for page ID {page_id} and revision ID {rev_id} - https://phabricator.wikimedia.org/T309151 [09:21:40] abijeet: done. [09:21:44] thanks [09:21:51] Amir1: and, I'm done :) [09:22:09] awesome [09:22:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: After migrating to 10.6.8', diff saved to https://phabricator.wikimedia.org/P28490 and previous config saved to /var/cache/conftool/dbconfig/20220525-092211-root.json [09:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:25] (03CR) 10Ladsgroup: [C: 03+2] logging: Add index hint when asking for a specific user [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798814 (https://phabricator.wikimedia.org/T303089) (owner: 10Ladsgroup) [09:22:27] (03CR) 10Ladsgroup: [C: 03+2] logging: Add index hint when asking for a specific user [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/798815 (https://phabricator.wikimedia.org/T303089) (owner: 10Ladsgroup) [09:22:54] (03Abandoned) 10Slyngshede: WIP: Run a public and private repo on a single host. [puppet] - 10https://gerrit.wikimedia.org/r/799264 (owner: 10Slyngshede) [09:22:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:08] (03CR) 10Btullis: Disable cleanup on unused Fairscheduler for Hadoop. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799257 (owner: 10Slyngshede) [09:27:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:27:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 1%: After reboot', diff saved to https://phabricator.wikimedia.org/P28491 and previous config saved to /var/cache/conftool/dbconfig/20220525-092837-root.json [09:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test1002.wikimedia.org [09:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1118.eqiad.wmnet with reason: Maintenance [09:29:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1118.eqiad.wmnet with reason: Maintenance [09:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T298555)', diff saved to https://phabricator.wikimedia.org/P28492 and previous config saved to /var/cache/conftool/dbconfig/20220525-092947-ladsgroup.json [09:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:54] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [09:30:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P28493 and previous config saved to /var/cache/conftool/dbconfig/20220525-093038-ladsgroup.json [09:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:41] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:37:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: After migrating to 10.6.8', diff saved to https://phabricator.wikimedia.org/P28494 and previous config saved to /var/cache/conftool/dbconfig/20220525-093715-root.json [09:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:31] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host idp-test1002.wikimedia.org [09:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:35] (03CR) 10Filippo Giunchedi: [C: 03+2] lvs: stop double-checking docker registry from Icinga [puppet] - 10https://gerrit.wikimedia.org/r/793815 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:37:51] PROBLEM - Check systemd state on idp-test1002 is CRITICAL: CRITICAL - degraded: The following units failed: memcached.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:39:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2002.wikimedia.org [09:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:30] (03Merged) 10jenkins-bot: logging: Add index hint when asking for a specific user [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798814 (https://phabricator.wikimedia.org/T303089) (owner: 10Ladsgroup) [09:43:07] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:43:16] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.13/includes/logging/LogPager.php: Backport: [[gerrit:798814|logging: Add index hint when asking for a specific user (T303089)]] (duration: 00m 49s) [09:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:20] T303089: Consistent fatal timeout when visiting Special:Log default view for some users - https://phabricator.wikimedia.org/T303089 [09:43:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 5%: After reboot', diff saved to https://phabricator.wikimedia.org/P28495 and previous config saved to /var/cache/conftool/dbconfig/20220525-094340-root.json [09:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T303603)', diff saved to https://phabricator.wikimedia.org/P28496 and previous config saved to /var/cache/conftool/dbconfig/20220525-094543-ladsgroup.json [09:45:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [09:45:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [09:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:48] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [09:45:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T303603)', diff saved to https://phabricator.wikimedia.org/P28497 and previous config saved to /var/cache/conftool/dbconfig/20220525-094551-ladsgroup.json [09:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:52] (03CR) 10CI reject: [V: 04-1] logging: Add index hint when asking for a specific user [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/798815 (https://phabricator.wikimedia.org/T303089) (owner: 10Ladsgroup) [09:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:55] (03CR) 10Ladsgroup: [C: 03+2] "sigh" [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/798815 (https://phabricator.wikimedia.org/T303089) (owner: 10Ladsgroup) [09:46:56] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host idp-test2002.wikimedia.org [09:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:17] PROBLEM - Check systemd state on idp-test2002 is CRITICAL: CRITICAL - degraded: The following units failed: memcached.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:50] (03PS2) 10Jelto: gitlab: switch backup location to /srv, don't use /etc [puppet] - 10https://gerrit.wikimedia.org/r/799016 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [09:47:52] (03PS1) 10Jelto: gitlab: rsync config and data backup to same folder on replica [puppet] - 10https://gerrit.wikimedia.org/r/799280 (https://phabricator.wikimedia.org/T274463) [09:48:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T303603)', diff saved to https://phabricator.wikimedia.org/P28498 and previous config saved to /var/cache/conftool/dbconfig/20220525-094841-ladsgroup.json [09:48:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:48:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:33] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35536/console" [puppet] - 10https://gerrit.wikimedia.org/r/799276 (https://phabricator.wikimedia.org/T309107) (owner: 10Filippo Giunchedi) [09:51:19] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [09:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:35] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [09:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: After migrating to 10.6.8', diff saved to https://phabricator.wikimedia.org/P28499 and previous config saved to /var/cache/conftool/dbconfig/20220525-095219-root.json [09:52:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:57] (03CR) 10Jelto: [C: 03+1] "lgtm. However I want to point out that this will only affect the replica." [puppet] - 10https://gerrit.wikimedia.org/r/799016 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [09:56:00] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35537/console" [puppet] - 10https://gerrit.wikimedia.org/r/799280 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [09:57:21] jouncebot: nowandnext [09:57:21] For the next 0 hour(s) and 2 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220525T0800) [09:57:22] In 3 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220525T1300) [09:58:12] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/image-suggestion: apply [09:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: After reboot', diff saved to https://phabricator.wikimedia.org/P28500 and previous config saved to /var/cache/conftool/dbconfig/20220525-095844-root.json [09:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:15] (03CR) 10Muehlenhoff: "The patch by itself is correct (and I'll merge it as-is), but for the IDPs will still need another followup: TLS was only enabled after th" [puppet] - 10https://gerrit.wikimedia.org/r/793744 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [10:00:04] (03CR) 10Hnowlan: [C: 03+2] Allow `LOGIN` for image_suggestions Cassandra user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/798977 (owner: 10Eevans) [10:00:48] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 (10MoritzMuehlenhoff) The IDPs needs a TLS-enabled build of the bullseye version of memcached, which was only enabled after the bullseye release (in 1.6.12). I'll create a separat... [10:01:35] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:01:49] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:02:48] (03Merged) 10jenkins-bot: logging: Add index hint when asking for a specific user [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/798815 (https://phabricator.wikimedia.org/T303089) (owner: 10Ladsgroup) [10:03:43] (03CR) 10Giuseppe Lavagetto: [C: 03+1] icinga: deprecate service::monitor class [puppet] - 10https://gerrit.wikimedia.org/r/793816 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:03:43] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 25 Jun 2022 07:55:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:03:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P28501 and previous config saved to /var/cache/conftool/dbconfig/20220525-100347-ladsgroup.json [10:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:01] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48107 bytes in 0.763 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:07:22] 10SRE, 10Wikimedia-Incident: text-https:443 has failed probes (retrospective task) - https://phabricator.wikimedia.org/T309178 (10jbond) [10:07:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: After migrating to 10.6.8', diff saved to https://phabricator.wikimedia.org/P28502 and previous config saved to /var/cache/conftool/dbconfig/20220525-100723-root.json [10:07:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:31] 10SRE, 10Wikimedia-Incident: text-https:443 has failed probes (retrospective task) - https://phabricator.wikimedia.org/T309178 (10jbond) p:05Triageβ†’03Medium [10:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:04] (03PS1) 10Hnowlan: cassandra-http-gateway: add missing log level [deployment-charts] - 10https://gerrit.wikimedia.org/r/799283 (https://phabricator.wikimedia.org/T304891) [10:08:06] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.12/includes/logging/LogPager.php: Backport: [[gerrit:798815|logging: Add index hint when asking for a specific user (T303089)]] (duration: 00m 52s) [10:08:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:08:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:11] T303089: Consistent fatal timeout when visiting Special:Log default view for some users - https://phabricator.wikimedia.org/T303089 [10:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:19] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/image-suggestion: apply [10:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:43] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [10:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:53] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/image-suggestion: apply [10:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:14] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/image-suggestion: apply [10:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:25] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/image-suggestion: apply [10:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:44] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: apply [10:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:15] 10SRE, 10SRE-OnFire, 10observability, 10Sustainability (Incident Followup), 10Wikimedia-Incident: create a sampled log of POST data - https://phabricator.wikimedia.org/T309186 (10jbond) [10:11:21] 10SRE, 10SRE-OnFire, 10observability, 10Sustainability (Incident Followup), 10Wikimedia-Incident: create a sampled log of POST data - https://phabricator.wikimedia.org/T309186 (10jbond) [10:12:20] (03PS6) 10Hnowlan: service: add image-suggestion ingress service [puppet] - 10https://gerrit.wikimedia.org/r/788753 (https://phabricator.wikimedia.org/T304891) [10:12:40] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:13] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [10:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:26] (03PS7) 10Hnowlan: service: add image-suggestion ingress service [puppet] - 10https://gerrit.wikimedia.org/r/788753 (https://phabricator.wikimedia.org/T304891) [10:13:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: After reboot', diff saved to https://phabricator.wikimedia.org/P28503 and previous config saved to /var/cache/conftool/dbconfig/20220525-101348-root.json [10:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:33] (03CR) 10Muehlenhoff: [C: 03+2] Only add component/memcached16 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/793744 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [10:18:35] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P28504 and previous config saved to /var/cache/conftool/dbconfig/20220525-101852-ladsgroup.json [10:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:08] (03PS1) 10Filippo Giunchedi: Fix problems found by github.com/cloudflare/pint [alerts] - 10https://gerrit.wikimedia.org/r/799285 (https://phabricator.wikimedia.org/T309182) [10:19:10] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "If we remove monitoring, we should also remove:" [puppet] - 10https://gerrit.wikimedia.org/r/793817 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:21:52] (03PS1) 10Muehlenhoff: Add repository component for TLS-enabled memcached [puppet] - 10https://gerrit.wikimedia.org/r/799286 (https://phabricator.wikimedia.org/T308214) [10:22:19] (03CR) 10Kosta Harlan: GrowthExperiments: Remove GEHomepageSuggestedEditsTopicsRequiresOptIn (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791303 (https://phabricator.wikimedia.org/T308209) (owner: 10Kosta Harlan) [10:24:02] 10SRE, 10Wikimedia-Incident: text-https:443 has failed probes (retrospective task) - https://phabricator.wikimedia.org/T309178 (10jbond) Incident report is here https://wikitech.wikimedia.org/wiki/Incidents/2022-05-25_de.wikipedia.org please review and update with any further information thanks [10:24:27] (03CR) 10Hnowlan: [C: 03+2] service: add image-suggestion ingress service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/788753 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [10:28:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 50%: After reboot', diff saved to https://phabricator.wikimedia.org/P28506 and previous config saved to /var/cache/conftool/dbconfig/20220525-102852-root.json [10:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:49] (03CR) 10Muehlenhoff: [C: 03+2] Add repository component for TLS-enabled memcached [puppet] - 10https://gerrit.wikimedia.org/r/799286 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [10:33:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T303603)', diff saved to https://phabricator.wikimedia.org/P28507 and previous config saved to /var/cache/conftool/dbconfig/20220525-103357-ladsgroup.json [10:33:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [10:34:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [10:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:04] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [10:34:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T303603)', diff saved to https://phabricator.wikimedia.org/P28509 and previous config saved to /var/cache/conftool/dbconfig/20220525-103405-ladsgroup.json [10:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:29] !log installing libxml2 security updates [10:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:58] Amir1: is it possible to merge no-op patches like https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/791303 or do those need to wait for a backport window? [10:42:46] kostajh: generally I'm okay with deploying changes out of backport window time as long as 1- you're not stepping on someone else's toes 2- It's not in non-deploy day like Friday 3- you know what you're doing (how to test, how to make sure it's breaking anything etc.) [10:43:26] the third implies you can self-serve :D [10:43:39] jouncebot: nowandnext [10:43:40] No deployments scheduled for the next 2 hour(s) and 16 minute(s) [10:43:40] In 2 hour(s) and 16 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220525T1300) [10:43:50] the window is free, you can do it if you want to [10:43:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: After reboot', diff saved to https://phabricator.wikimedia.org/P28510 and previous config saved to /var/cache/conftool/dbconfig/20220525-104356-root.json [10:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:44] Amir1: hmm, ok perhaps I'll do it now then [10:46:00] eh, or not, as I have a meeting starting in a few minutes [10:46:19] last question – would those patches need to be registered in a deployment calendar somewhere? [10:46:33] !log restarting FPM on mediawiki canaries to pick up libxml updates [10:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:05] (03PS1) 10Lucas Werkmeister (WMDE): query_service: don’t cache index files [puppet] - 10https://gerrit.wikimedia.org/r/799297 (https://phabricator.wikimedia.org/T289243) [10:50:28] (03PS1) 10Volans: devices: remove unused metadata [homer/public] - 10https://gerrit.wikimedia.org/r/799298 [10:54:05] (03CR) 10Lucas Werkmeister (WMDE): "I’m not very familiar with the combination of β€―+β€―, nor with HTTP caching, so I’d definitely appreciate some review here " [puppet] - 10https://gerrit.wikimedia.org/r/799297 (https://phabricator.wikimedia.org/T289243) (owner: 10Lucas Werkmeister (WMDE)) [10:58:17] (03PS1) 10Jbond: rake: spdx fix variable name path [puppet] - 10https://gerrit.wikimedia.org/r/799299 [10:58:28] (03CR) 10Jbond: [V: 03+2 C: 03+2] rake: spdx fix variable name path [puppet] - 10https://gerrit.wikimedia.org/r/799299 (owner: 10Jbond) [10:59:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: After reboot', diff saved to https://phabricator.wikimedia.org/P28511 and previous config saved to /var/cache/conftool/dbconfig/20220525-105900-root.json [10:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:11] (03CR) 10David Caro: [C: 03+2] cloudvirt: redirect prometheus script errors to journal [puppet] - 10https://gerrit.wikimedia.org/r/790387 (owner: 10David Caro) [11:08:23] (03CR) 10David Caro: [C: 03+2] wmcs-k8s-node-upgrade: add some extra logs [puppet] - 10https://gerrit.wikimedia.org/r/791348 (owner: 10David Caro) [11:08:34] (03CR) 10David Caro: [C: 03+2] wmcs-k8s-node-upgrade: black and isort [puppet] - 10https://gerrit.wikimedia.org/r/792112 (owner: 10David Caro) [11:09:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T303603)', diff saved to https://phabricator.wikimedia.org/P28512 and previous config saved to /var/cache/conftool/dbconfig/20220525-110859-ladsgroup.json [11:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:07] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [11:09:45] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:12:10] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [11:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:53] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:17] (03CR) 10Volans: [C: 04-1] "LGTM, two typos, see inline. Consider this a +1 once fixed." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/786960 (https://phabricator.wikimedia.org/T304533) (owner: 10David Caro) [11:24:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P28513 and previous config saved to /var/cache/conftool/dbconfig/20220525-112404-ladsgroup.json [11:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:32] (03PS1) 10Muehlenhoff: mailman3: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799305 (https://phabricator.wikimedia.org/T308013) [11:31:34] (03PS1) 10Muehlenhoff: testreduce: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799306 (https://phabricator.wikimedia.org/T308013) [11:31:36] (03PS1) 10Muehlenhoff: gdnsd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799307 (https://phabricator.wikimedia.org/T308013) [11:31:40] (03PS1) 10Muehlenhoff: dragonfly: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799308 (https://phabricator.wikimedia.org/T308013) [11:31:42] (03PS1) 10Muehlenhoff: alerts: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799309 (https://phabricator.wikimedia.org/T308013) [11:31:44] (03PS1) 10Muehlenhoff: opensearch: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799310 (https://phabricator.wikimedia.org/T308013) [11:31:46] (03PS1) 10Muehlenhoff: envoyproxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799311 (https://phabricator.wikimedia.org/T308013) [11:31:48] (03PS1) 10Muehlenhoff: motd/kmod/debconf: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799312 (https://phabricator.wikimedia.org/T308013) [11:31:50] (03PS1) 10Muehlenhoff: netconsole/systemtap/haveged: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799313 (https://phabricator.wikimedia.org/T308013) [11:32:13] (03CR) 10CI reject: [V: 04-1] mailman3: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799305 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:32:16] (03PS1) 10Jbond: WIP/Do Not Merge: demonstrate some debugging steps I use [puppet] - 10https://gerrit.wikimedia.org/r/799314 [11:38:50] (03CR) 10Jbond: "Adding morits." [puppet] - 10https://gerrit.wikimedia.org/r/799268 (owner: 10Majavah) [11:39:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P28514 and previous config saved to /var/cache/conftool/dbconfig/20220525-113909-ladsgroup.json [11:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:48] (03CR) 10Jbond: [C: 03+1] Change order that Netbox server provision script gets old/new vlan name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/799011 (https://phabricator.wikimedia.org/T304936) (owner: 10Cathal Mooney) [11:39:50] (03CR) 10CI reject: [V: 04-1] WIP/Do Not Merge: demonstrate some debugging steps I use [puppet] - 10https://gerrit.wikimedia.org/r/799314 (owner: 10Jbond) [11:40:06] !log restarting clamav on otrs1001 for libxml update [11:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:39] RECOVERY - Check systemd state on mw1320 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:31] (03CR) 10Jbond: [C: 03+1] prometheus: clarify and document 'timeout' service::catalog probe option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799276 (https://phabricator.wikimedia.org/T309107) (owner: 10Filippo Giunchedi) [11:46:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T298555)', diff saved to https://phabricator.wikimedia.org/P28515 and previous config saved to /var/cache/conftool/dbconfig/20220525-114640-ladsgroup.json [11:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:45] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [11:47:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/799306 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:47:43] (03CR) 10Jbond: [C: 03+1] alerts: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799309 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:48:07] (03CR) 10Jbond: [C: 03+1] opensearch: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799310 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:48:36] (03PS1) 10David Caro: build: flag the module as typed [software/spicerack] - 10https://gerrit.wikimedia.org/r/799316 [11:48:48] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/799311 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:49:16] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/799313 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:50:10] (03CR) 10Jbond: nrpe: manage sudo rules via nrpe::check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah) [11:51:38] (03CR) 10Slyngshede: [C: 03+2] WIP: Run multiple apt repos on a single host. [puppet] - 10https://gerrit.wikimedia.org/r/799279 (owner: 10Slyngshede) [11:51:41] (03CR) 10Jbond: [C: 03+2] "LKGTm will merge thanks" [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah) [11:52:21] RECOVERY - Check systemd state on mw2351 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T303603)', diff saved to https://phabricator.wikimedia.org/P28516 and previous config saved to /var/cache/conftool/dbconfig/20220525-115414-ladsgroup.json [11:54:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [11:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [11:54:21] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [11:54:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [11:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [11:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:54:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:41] (03PS5) 10David Caro: Move from deprecated icinga_hosts to alerting_hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/786960 (https://phabricator.wikimedia.org/T304533) [11:56:43] (03CR) 10David Caro: Move from deprecated icinga_hosts to alerting_hosts (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/786960 (https://phabricator.wikimedia.org/T304533) (owner: 10David Caro) [11:59:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [11:59:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [11:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T303603)', diff saved to https://phabricator.wikimedia.org/P28517 and previous config saved to /var/cache/conftool/dbconfig/20220525-115932-ladsgroup.json [11:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:41] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [12:00:07] (03PS1) 10Majavah: labstore: update monitoring for nrpe changes [puppet] - 10https://gerrit.wikimedia.org/r/799318 [12:01:07] (03PS2) 10Majavah: labstore: update monitoring for nrpe changes [puppet] - 10https://gerrit.wikimedia.org/r/799318 [12:01:33] (03CR) 10David Caro: [C: 03+2] Move from deprecated icinga_hosts to alerting_hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/786960 (https://phabricator.wikimedia.org/T304533) (owner: 10David Caro) [12:01:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P28518 and previous config saved to /var/cache/conftool/dbconfig/20220525-120145-ladsgroup.json [12:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:18] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35539/console" [puppet] - 10https://gerrit.wikimedia.org/r/799318 (owner: 10Majavah) [12:02:47] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:03:13] (KubernetesRsyslogDown) firing: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:03:23] (03PS1) 10Jbond: Revert "nrpe: manage sudo rules via nrpe::check" [puppet] - 10https://gerrit.wikimedia.org/r/798823 [12:03:58] (03CR) 10Jbond: [C: 03+2] Revert "nrpe: manage sudo rules via nrpe::check" [puppet] - 10https://gerrit.wikimedia.org/r/798823 (owner: 10Jbond) [12:04:10] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "nrpe: manage sudo rules via nrpe::check" [puppet] - 10https://gerrit.wikimedia.org/r/798823 (owner: 10Jbond) [12:04:50] (03Merged) 10jenkins-bot: Move from deprecated icinga_hosts to alerting_hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/786960 (https://phabricator.wikimedia.org/T304533) (owner: 10David Caro) [12:06:05] (03CR) 10Jbond: [V: 03+2 C: 03+2] "need to dig into this further but its possible that it required two runs to work properly???" [puppet] - 10https://gerrit.wikimedia.org/r/798823 (owner: 10Jbond) [12:08:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T303603)', diff saved to https://phabricator.wikimedia.org/P28519 and previous config saved to /var/cache/conftool/dbconfig/20220525-120800-ladsgroup.json [12:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:08] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [12:08:56] (03PS1) 10Cathal Mooney: Changes to includes on reverse zone for 2620:0:861::/48 Eqiad [dns] - 10https://gerrit.wikimedia.org/r/799319 (https://phabricator.wikimedia.org/T304936) [12:09:48] (03CR) 10CI reject: [V: 04-1] Changes to includes on reverse zone for 2620:0:861::/48 Eqiad [dns] - 10https://gerrit.wikimedia.org/r/799319 (https://phabricator.wikimedia.org/T304936) (owner: 10Cathal Mooney) [12:16:26] (03PS1) 10Marostegui: Revert "db1127: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/798824 [12:16:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P28520 and previous config saved to /var/cache/conftool/dbconfig/20220525-121650-ladsgroup.json [12:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:09] (03CR) 10Marostegui: [C: 03+2] Revert "db1127: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/798824 (owner: 10Marostegui) [12:20:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 1%: After investigating HW issues', diff saved to https://phabricator.wikimedia.org/P28521 and previous config saved to /var/cache/conftool/dbconfig/20220525-122008-root.json [12:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Adjust weight db1127', diff saved to https://phabricator.wikimedia.org/P28522 and previous config saved to /var/cache/conftool/dbconfig/20220525-122040-marostegui.json [12:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:02] !log imported openjdk-8 8u332-ga-1~deb10u1 for buster-wikimedia [12:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P28523 and previous config saved to /var/cache/conftool/dbconfig/20220525-122305-ladsgroup.json [12:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:33] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:31:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T298555)', diff saved to https://phabricator.wikimedia.org/P28524 and previous config saved to /var/cache/conftool/dbconfig/20220525-123155-ladsgroup.json [12:31:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1134.eqiad.wmnet with reason: Maintenance [12:31:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1134.eqiad.wmnet with reason: Maintenance [12:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:02] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [12:32:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T298555)', diff saved to https://phabricator.wikimedia.org/P28525 and previous config saved to /var/cache/conftool/dbconfig/20220525-123203-ladsgroup.json [12:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 5%: After investigating HW issues', diff saved to https://phabricator.wikimedia.org/P28526 and previous config saved to /var/cache/conftool/dbconfig/20220525-123512-root.json [12:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:51] 10SRE, 10Wikimedia-Mailing-lists: Mailing list for Turkic Wikimedians - https://phabricator.wikimedia.org/T309155 (10Mehman97) Thanks. [12:38:04] (03CR) 10Volans: [C: 03+1] "Sure, I actually thought we had already added it :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/799316 (owner: 10David Caro) [12:38:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P28527 and previous config saved to /var/cache/conftool/dbconfig/20220525-123811-ladsgroup.json [12:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:00] (03PS2) 10David Caro: build: flag the module as typed [software/spicerack] - 10https://gerrit.wikimedia.org/r/799316 [12:41:06] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:16] (03CR) 10David Caro: [C: 03+2] build: flag the module as typed [software/spicerack] - 10https://gerrit.wikimedia.org/r/799316 (owner: 10David Caro) [12:41:20] !log cmooney@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [12:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:32] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:13] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:44:41] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:30] (03PS2) 10Cathal Mooney: Changes to includes on reverse zone for 2620:0:861::/48 Eqiad [dns] - 10https://gerrit.wikimedia.org/r/799319 (https://phabricator.wikimedia.org/T304936) [12:48:03] (03PS1) 10Elukey: ml-services: update image for enwiki goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/799328 (https://phabricator.wikimedia.org/T309102) [12:49:23] (03CR) 10Cathal Mooney: [C: 03+2] Changes to includes on reverse zone for 2620:0:861::/48 Eqiad [dns] - 10https://gerrit.wikimedia.org/r/799319 (https://phabricator.wikimedia.org/T304936) (owner: 10Cathal Mooney) [12:50:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 10%: After investigating HW issues', diff saved to https://phabricator.wikimedia.org/P28528 and previous config saved to /var/cache/conftool/dbconfig/20220525-125016-root.json [12:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:03] (03Merged) 10jenkins-bot: build: flag the module as typed [software/spicerack] - 10https://gerrit.wikimedia.org/r/799316 (owner: 10David Caro) [12:53:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T303603)', diff saved to https://phabricator.wikimedia.org/P28529 and previous config saved to /var/cache/conftool/dbconfig/20220525-125316-ladsgroup.json [12:53:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:53:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:22] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [12:53:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T303603)', diff saved to https://phabricator.wikimedia.org/P28530 and previous config saved to /var/cache/conftool/dbconfig/20220525-125324-ladsgroup.json [12:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:24] (03PS2) 10Muehlenhoff: netconsole/systemtap/haveged: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799313 (https://phabricator.wikimedia.org/T308013) [12:58:17] (03PS1) 10Majavah: P:openstack: remove unused base db class [puppet] - 10https://gerrit.wikimedia.org/r/799330 [12:58:17] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:58:25] (03PS1) 10Majavah: P:openstack: remove labs_hosts_range from hiera [puppet] - 10https://gerrit.wikimedia.org/r/799331 [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220525T1300). [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:00:29] (03PS2) 10Filippo Giunchedi: prometheus: clarify and document 'timeout' service::catalog probe option [puppet] - 10https://gerrit.wikimedia.org/r/799276 (https://phabricator.wikimedia.org/T309107) [13:00:33] (03PS2) 10Filippo Giunchedi: Revert "hieradata: temp disable paging for thumbor" [puppet] - 10https://gerrit.wikimedia.org/r/799277 (https://phabricator.wikimedia.org/T309107) [13:00:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T303603)', diff saved to https://phabricator.wikimedia.org/P28531 and previous config saved to /var/cache/conftool/dbconfig/20220525-130035-ladsgroup.json [13:00:37] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: clarify and document 'timeout' service::catalog probe option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799276 (https://phabricator.wikimedia.org/T309107) (owner: 10Filippo Giunchedi) [13:00:41] (03CR) 10Muehlenhoff: [C: 03+2] netconsole/systemtap/haveged: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799313 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:42] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [13:00:55] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:01:03] (03PS2) 10Muehlenhoff: mailman3: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799305 (https://phabricator.wikimedia.org/T308013) [13:01:07] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: clarify and document 'timeout' service::catalog probe option [puppet] - 10https://gerrit.wikimedia.org/r/799276 (https://phabricator.wikimedia.org/T309107) (owner: 10Filippo Giunchedi) [13:01:19] (03PS3) 10Muehlenhoff: mailman3: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799305 (https://phabricator.wikimedia.org/T308013) [13:01:27] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 6 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35541/console" [puppet] - 10https://gerrit.wikimedia.org/r/799331 (owner: 10Majavah) [13:01:41] (03PS2) 10Majavah: P:openstack: remove labs_hosts_range from hiera [puppet] - 10https://gerrit.wikimedia.org/r/799331 [13:02:19] (03CR) 10Filippo Giunchedi: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/799276 (https://phabricator.wikimedia.org/T309107) (owner: 10Filippo Giunchedi) [13:02:46] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 6 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35542/console" [puppet] - 10https://gerrit.wikimedia.org/r/799331 (owner: 10Majavah) [13:03:17] (03PS3) 10Filippo Giunchedi: prometheus: clarify and document 'timeout' service::catalog probe option [puppet] - 10https://gerrit.wikimedia.org/r/799276 (https://phabricator.wikimedia.org/T309107) [13:03:34] (03PS2) 10David Caro: wmcs: isort and black [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788676 [13:03:36] (03PS7) 10David Caro: wmcs: Fix types and associated code refactor [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788351 [13:04:04] (03CR) 10Filippo Giunchedi: [V: 03+2] prometheus: clarify and document 'timeout' service::catalog probe option [puppet] - 10https://gerrit.wikimedia.org/r/799276 (https://phabricator.wikimedia.org/T309107) (owner: 10Filippo Giunchedi) [13:04:35] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS13030/IPv6: Connect - Init7, AS13030/IPv4: Connect - Init7, AS6939/IPv4: Connect - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:05:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 20%: After investigating HW issues', diff saved to https://phabricator.wikimedia.org/P28532 and previous config saved to /var/cache/conftool/dbconfig/20220525-130519-root.json [13:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:45] (03PS3) 10Filippo Giunchedi: Revert "hieradata: temp disable paging for thumbor" [puppet] - 10https://gerrit.wikimedia.org/r/799277 (https://phabricator.wikimedia.org/T309107) [13:08:38] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "hieradata: temp disable paging for thumbor" [puppet] - 10https://gerrit.wikimedia.org/r/799277 (https://phabricator.wikimedia.org/T309107) (owner: 10Filippo Giunchedi) [13:10:19] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: deprecate service::monitor class [puppet] - 10https://gerrit.wikimedia.org/r/793816 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:11:07] RECOVERY - BGP status on cr1-drmrs is OK: BGP OK - up: 21, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:12:44] (03CR) 10David Caro: [C: 03+2] wmcs: Fix types and associated code refactor [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788351 (owner: 10David Caro) [13:12:46] (03CR) 10David Caro: [C: 03+2] wmcs: isort and black [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788676 (owner: 10David Caro) [13:12:48] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/799309 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:13:15] RECOVERY - Memcached on idp-test2002 is OK: TCP OK - 0.033 second response time on 208.80.153.70 port 11000 https://wikitech.wikimedia.org/wiki/Memcached [13:13:43] RECOVERY - Check systemd state on idp-test2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P28533 and previous config saved to /var/cache/conftool/dbconfig/20220525-131540-ladsgroup.json [13:15:45] !log imported memcached 1.6.9+dfsg-1+wmf11u1 to bullseye-wikimedia (TLS-enabled build) T308214 [13:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:52] T308214: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 [13:15:55] (03Merged) 10jenkins-bot: wmcs: isort and black [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788676 (owner: 10David Caro) [13:15:57] (03Merged) 10jenkins-bot: wmcs: Fix types and associated code refactor [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788351 (owner: 10David Caro) [13:17:23] (03CR) 10David Caro: [C: 03+2] "LGTM πŸ‘" [puppet] - 10https://gerrit.wikimedia.org/r/799331 (owner: 10Majavah) [13:17:46] (03CR) 10David Caro: [C: 03+2] ":deletedelete:" [puppet] - 10https://gerrit.wikimedia.org/r/799330 (owner: 10Majavah) [13:19:45] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.196 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:20:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 25%: After investigating HW issues', diff saved to https://phabricator.wikimedia.org/P28534 and previous config saved to /var/cache/conftool/dbconfig/20220525-132023-root.json [13:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:50] (03CR) 10Ladsgroup: mailman3: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799305 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:22:27] (03PS4) 10Muehlenhoff: mailman3: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799305 (https://phabricator.wikimedia.org/T308013) [13:22:39] (03CR) 10Muehlenhoff: mailman3: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799305 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:23:26] (03PS5) 10Filippo Giunchedi: icinga: remove 'monitoring' from service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/793817 (https://phabricator.wikimedia.org/T291946) [13:23:28] (03CR) 10Ladsgroup: [C: 03+1] mailman3: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799305 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:24:13] (03CR) 10Filippo Giunchedi: icinga: remove 'monitoring' from service::catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793817 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:27:11] (03PS1) 10Jelto: wikimedia.org: add gitlab-new records + PTR [dns] - 10https://gerrit.wikimedia.org/r/799334 (https://phabricator.wikimedia.org/T307142) [13:27:58] (03CR) 10CI reject: [V: 04-1] wikimedia.org: add gitlab-new records + PTR [dns] - 10https://gerrit.wikimedia.org/r/799334 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [13:29:37] (03PS2) 10Jelto: wikimedia.org: add gitlab-new records + PTR [dns] - 10https://gerrit.wikimedia.org/r/799334 (https://phabricator.wikimedia.org/T307142) [13:30:16] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Checked running here locally and it's a no-op across the entire estate." [homer/public] - 10https://gerrit.wikimedia.org/r/799298 (owner: 10Volans) [13:30:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P28535 and previous config saved to /var/cache/conftool/dbconfig/20220525-133046-ladsgroup.json [13:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:42] (03CR) 10Cathal Mooney: [C: 03+2] devices: remove unused metadata [homer/public] - 10https://gerrit.wikimedia.org/r/799298 (owner: 10Volans) [13:32:10] (03PS2) 10Jbond: WIP/Do Not Merge: demonstrate some debugging steps I use [puppet] - 10https://gerrit.wikimedia.org/r/799314 [13:32:24] (03Merged) 10jenkins-bot: devices: remove unused metadata [homer/public] - 10https://gerrit.wikimedia.org/r/799298 (owner: 10Volans) [13:32:54] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@3ae51e7]: (no justification provided) [13:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:05] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@3ae51e7]: (no justification provided) (duration: 00m 11s) [13:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 40%: After investigating HW issues', diff saved to https://phabricator.wikimedia.org/P28536 and previous config saved to /var/cache/conftool/dbconfig/20220525-133527-root.json [13:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:06] (03CR) 10CI reject: [V: 04-1] WIP/Do Not Merge: demonstrate some debugging steps I use [puppet] - 10https://gerrit.wikimedia.org/r/799314 (owner: 10Jbond) [13:41:14] (03PS1) 10Slyngshede: P:aptrepo::private refactor code to allow for multiple repositories. [puppet] - 10https://gerrit.wikimedia.org/r/799340 [13:41:46] (03CR) 10Elukey: [C: 03+2] ml-services: update image for enwiki goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/799328 (https://phabricator.wikimedia.org/T309102) (owner: 10Elukey) [13:42:24] (03CR) 10CI reject: [V: 04-1] P:aptrepo::private refactor code to allow for multiple repositories. [puppet] - 10https://gerrit.wikimedia.org/r/799340 (owner: 10Slyngshede) [13:44:29] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T303603)', diff saved to https://phabricator.wikimedia.org/P28537 and previous config saved to /var/cache/conftool/dbconfig/20220525-134551-ladsgroup.json [13:45:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:45:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:57] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [13:45:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T303603)', diff saved to https://phabricator.wikimedia.org/P28538 and previous config saved to /var/cache/conftool/dbconfig/20220525-134559-ladsgroup.json [13:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:31] (03PS1) 10Cathal Mooney: Move cloudsw capirca def to roles and asn to devices [homer/public] - 10https://gerrit.wikimedia.org/r/799341 (https://phabricator.wikimedia.org/T304936) [13:47:01] (03CR) 10CI reject: [V: 04-1] Move cloudsw capirca def to roles and asn to devices [homer/public] - 10https://gerrit.wikimedia.org/r/799341 (https://phabricator.wikimedia.org/T304936) (owner: 10Cathal Mooney) [13:47:44] (03PS2) 10Slyngshede: P:aptrepo::private refactor code to allow for multiple repositories. [puppet] - 10https://gerrit.wikimedia.org/r/799340 [13:48:00] (03PS2) 10Cathal Mooney: Move cloudsw capirca def to roles and asn to devices [homer/public] - 10https://gerrit.wikimedia.org/r/799341 (https://phabricator.wikimedia.org/T304936) [13:48:19] (03CR) 10CI reject: [V: 04-1] P:aptrepo::private refactor code to allow for multiple repositories. [puppet] - 10https://gerrit.wikimedia.org/r/799340 (owner: 10Slyngshede) [13:50:09] (03PS1) 10Jbond: wmflib::service: add data loader class [puppet] - 10https://gerrit.wikimedia.org/r/799342 [13:50:11] (03CR) 10Cathal Mooney: [C: 03+2] Move cloudsw capirca def to roles and asn to devices [homer/public] - 10https://gerrit.wikimedia.org/r/799341 (https://phabricator.wikimedia.org/T304936) (owner: 10Cathal Mooney) [13:50:13] (03CR) 10Vgutierrez: Revert "Cache Badtitle 400s for 60s in varnish-fe" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769827 (owner: 10Legoktm) [13:50:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 50%: After investigating HW issues', diff saved to https://phabricator.wikimedia.org/P28539 and previous config saved to /var/cache/conftool/dbconfig/20220525-135031-root.json [13:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:44] (03CR) 10CI reject: [V: 04-1] wmflib::service: add data loader class [puppet] - 10https://gerrit.wikimedia.org/r/799342 (owner: 10Jbond) [13:51:01] (03Merged) 10jenkins-bot: Move cloudsw capirca def to roles and asn to devices [homer/public] - 10https://gerrit.wikimedia.org/r/799341 (https://phabricator.wikimedia.org/T304936) (owner: 10Cathal Mooney) [13:52:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T303603)', diff saved to https://phabricator.wikimedia.org/P28540 and previous config saved to /var/cache/conftool/dbconfig/20220525-135258-ladsgroup.json [13:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:05] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [13:53:43] (03PS3) 10Slyngshede: P:aptrepo::private refactor code to allow for multiple repositories. [puppet] - 10https://gerrit.wikimedia.org/r/799340 [13:54:19] (03PS1) 10David Caro: Add missing secret for cloud acme-chief account [labs/private] - 10https://gerrit.wikimedia.org/r/799343 [13:55:45] (03CR) 10David Caro: [C: 03+2] Add missing secret for cloud acme-chief account [labs/private] - 10https://gerrit.wikimedia.org/r/799343 (owner: 10David Caro) [13:56:12] (03CR) 10David Caro: [V: 03+2 C: 03+2] Add missing secret for cloud acme-chief account [labs/private] - 10https://gerrit.wikimedia.org/r/799343 (owner: 10David Caro) [13:57:27] (03PS1) 10Majavah: nrpe: manage sudo rules via nrpe::check (try 2) [puppet] - 10https://gerrit.wikimedia.org/r/799344 [13:59:41] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 4 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35544/console" [puppet] - 10https://gerrit.wikimedia.org/r/799344 (owner: 10Majavah) [14:00:26] (03PS7) 10Jbond: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [14:00:28] (03PS3) 10Jbond: WIP/Do Not Merge: demonstrate some debugging steps I use [puppet] - 10https://gerrit.wikimedia.org/r/799314 [14:00:47] (03PS1) 10David Caro: cloudinfra: Add missing designate password [labs/private] - 10https://gerrit.wikimedia.org/r/799345 [14:02:04] (03CR) 10CI reject: [V: 04-1] WIP/Do Not Merge: demonstrate some debugging steps I use [puppet] - 10https://gerrit.wikimedia.org/r/799314 (owner: 10Jbond) [14:02:27] jouncebot: nowandnext [14:02:27] No deployments scheduled for the next 3 hour(s) and 57 minute(s) [14:02:27] In 3 hour(s) and 57 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220525T1800) [14:02:27] In 3 hour(s) and 57 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220525T1800) [14:02:57] (03CR) 10CI reject: [V: 04-1] Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [14:05:18] I’ll deploy some security fixes if that’s alright [14:05:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 60%: After investigating HW issues', diff saved to https://phabricator.wikimedia.org/P28541 and previous config saved to /var/cache/conftool/dbconfig/20220525-140535-root.json [14:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:19] (03PS1) 10Muehlenhoff: memcached: Untangle TLS/1.6 options [puppet] - 10https://gerrit.wikimedia.org/r/799348 (https://phabricator.wikimedia.org/T308214) [14:07:54] (03CR) 10CI reject: [V: 04-1] memcached: Untangle TLS/1.6 options [puppet] - 10https://gerrit.wikimedia.org/r/799348 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [14:08:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P28542 and previous config saved to /var/cache/conftool/dbconfig/20220525-140803-ladsgroup.json [14:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:20] (03PS2) 10Majavah: nrpe: manage sudo rules via nrpe::check (try 2) [puppet] - 10https://gerrit.wikimedia.org/r/799344 [14:08:58] (03PS2) 10Muehlenhoff: memcached: Untangle TLS/1.6 options [puppet] - 10https://gerrit.wikimedia.org/r/799348 (https://phabricator.wikimedia.org/T308214) [14:09:36] (03CR) 10CI reject: [V: 04-1] memcached: Untangle TLS/1.6 options [puppet] - 10https://gerrit.wikimedia.org/r/799348 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [14:09:47] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 4 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35545/console" [puppet] - 10https://gerrit.wikimedia.org/r/799344 (owner: 10Majavah) [14:10:01] (03CR) 10Majavah: "not sure if this fixes everything, but at least the compilation errors are fixed" [puppet] - 10https://gerrit.wikimedia.org/r/799344 (owner: 10Majavah) [14:10:49] (03PS3) 10Majavah: nrpe: manage sudo rules via nrpe::check (try 2) [puppet] - 10https://gerrit.wikimedia.org/r/799344 [14:11:19] (03PS1) 10Elukey: ml-services: update docker image for revscoring-editquality-* [deployment-charts] - 10https://gerrit.wikimedia.org/r/799349 (https://phabricator.wikimedia.org/T309102) [14:12:28] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 4 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35546/console" [puppet] - 10https://gerrit.wikimedia.org/r/799344 (owner: 10Majavah) [14:12:49] (03CR) 10Elukey: "Aiko: this change should also rollout your changes that you were testing in arwiki-goodfaith, lemme know if it is ok to deploy them everyw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/799349 (https://phabricator.wikimedia.org/T309102) (owner: 10Elukey) [14:14:40] (03PS3) 10Muehlenhoff: memcached: Untangle TLS/1.6 options [puppet] - 10https://gerrit.wikimedia.org/r/799348 (https://phabricator.wikimedia.org/T308214) [14:18:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/799348 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [14:20:37] (03CR) 10Muehlenhoff: [C: 03+2] mailman3: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799305 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:20:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 75%: After investigating HW issues', diff saved to https://phabricator.wikimedia.org/P28543 and previous config saved to /var/cache/conftool/dbconfig/20220525-142039-root.json [14:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P28544 and previous config saved to /var/cache/conftool/dbconfig/20220525-142308-ladsgroup.json [14:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:51] (03Abandoned) 10Jbond: wmflib::service: add data loader class [puppet] - 10https://gerrit.wikimedia.org/r/799342 (owner: 10Jbond) [14:26:00] (03PS8) 10Jbond: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [14:28:39] (03CR) 10CI reject: [V: 04-1] Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [14:31:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:32:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:19] (03Restored) 10Jbond: wmflib::service: add data loader class [puppet] - 10https://gerrit.wikimedia.org/r/799342 (owner: 10Jbond) [14:33:30] (03PS2) 10Jbond: wmflib::service: add data loader class [puppet] - 10https://gerrit.wikimedia.org/r/799342 [14:33:32] (03PS9) 10Jbond: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [14:33:38] !log Deployed patches for T308659 [14:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:49] (03Abandoned) 10Jbond: WIP/Do Not Merge: demonstrate some debugging steps I use [puppet] - 10https://gerrit.wikimedia.org/r/799314 (owner: 10Jbond) [14:34:05] (I’m done) [14:35:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 100%: After investigating HW issues', diff saved to https://phabricator.wikimedia.org/P28545 and previous config saved to /var/cache/conftool/dbconfig/20220525-143543-root.json [14:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:07] (03PS3) 10Jbond: wmflib::service: add data loader class [puppet] - 10https://gerrit.wikimedia.org/r/799342 [14:36:33] (03PS1) 10Ladsgroup: [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) [14:36:51] (03CR) 10Jbond: wmflib::service: add data loader class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799342 (owner: 10Jbond) [14:37:14] (03CR) 10CI reject: [V: 04-1] Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [14:37:41] (03CR) 10CI reject: [V: 04-1] [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [14:38:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T303603)', diff saved to https://phabricator.wikimedia.org/P28546 and previous config saved to /var/cache/conftool/dbconfig/20220525-143813-ladsgroup.json [14:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:20] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [14:40:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:41:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:47] (03CR) 10David Caro: [C: 03+2] cloudinfra: Add missing designate password [labs/private] - 10https://gerrit.wikimedia.org/r/799345 (owner: 10David Caro) [14:44:49] (03CR) 10David Caro: [V: 03+2 C: 03+2] cloudinfra: Add missing designate password [labs/private] - 10https://gerrit.wikimedia.org/r/799345 (owner: 10David Caro) [14:45:42] (03CR) 10David Caro: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35547/console" [puppet] - 10https://gerrit.wikimedia.org/r/779032 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:47:18] lists.wikimedia.org is now REALLY slow for me. is it just me? [14:48:27] (03CR) 10David Caro: [V: 03+1 C: 03+2] "The pcc looks good, and now we can run it on this host :)" [puppet] - 10https://gerrit.wikimedia.org/r/779032 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:49:35] urbanecm: not just you. My browser just timed out trying to load https://lists.wikimedia.org/ [14:50:09] 10SRE, 10Data-Engineering, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10Ottomata) Moving back to incoming, this is not an Ops Week task. [14:50:53] thanks bd808. Amir1 (or others), can you check what's up with it please? [14:51:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10nskaggs) Just wondering on the status of these machines. Anything I can help with? [14:52:21] urbanecm: loads for me [14:52:36] :( [14:52:40] loading now for me as well [14:53:34] started working fine again. thanks :) [14:53:47] 10SRE, 10Gerrit: Icinga Check SSL might have a time based race condition - https://phabricator.wikimedia.org/T308908 (10dancy) Another flap happened last night. @RhinosF1 has suggested restarting apache since it still has a worker running from last month. The restart hasn't happened yet though. ` dancy@ge... [14:54:53] (03CR) 10Muehlenhoff: "Thanks for nudging this forward, Taavi!" [puppet] - 10https://gerrit.wikimedia.org/r/799268 (owner: 10Majavah) [14:55:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10nskaggs) I believe this task is also now cautiously ready to proceed with the finalization of the design in T304989. @cmooney ca... [14:56:58] (03CR) 10BryanDavis: [C: 03+2] helmfile.d: add developer-portal [deployment-charts] - 10https://gerrit.wikimedia.org/r/773995 (https://phabricator.wikimedia.org/T297140) (owner: 10Majavah) [14:58:09] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Mabualruz - https://phabricator.wikimedia.org/T309215 (10Jdrewniak) [14:58:43] (03CR) 10Jbond: "this looks good but lets have a chat tomorrow i think im still missing a bit of context" [puppet] - 10https://gerrit.wikimedia.org/r/799340 (owner: 10Slyngshede) [15:00:39] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:00:54] (03PS2) 10Muehlenhoff: opensearch: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799310 (https://phabricator.wikimedia.org/T308013) [15:01:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298555)', diff saved to https://phabricator.wikimedia.org/P28547 and previous config saved to /var/cache/conftool/dbconfig/20220525-150107-ladsgroup.json [15:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:14] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [15:02:44] (03Merged) 10jenkins-bot: helmfile.d: add developer-portal [deployment-charts] - 10https://gerrit.wikimedia.org/r/773995 (https://phabricator.wikimedia.org/T297140) (owner: 10Majavah) [15:04:23] 10SRE, 10SRE-OnFire, 10Observability-Logging, 10Sustainability (Incident Followup), 10Wikimedia-Incident: create a sampled log of POST data - https://phabricator.wikimedia.org/T309186 (10colewhite) [15:05:24] (03CR) 10Jbond: [C: 03+1] "LGTM minor nit" [puppet] - 10https://gerrit.wikimedia.org/r/799348 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [15:08:34] (03PS1) 10Muehlenhoff: idp::memcached: Only enable memcached_16 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/799354 (https://phabricator.wikimedia.org/T308214) [15:10:12] (03PS4) 10Muehlenhoff: memcached: Untangle TLS/1.6 options [puppet] - 10https://gerrit.wikimedia.org/r/799348 (https://phabricator.wikimedia.org/T308214) [15:10:24] (03CR) 10Muehlenhoff: memcached: Untangle TLS/1.6 options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799348 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [15:16:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P28548 and previous config saved to /var/cache/conftool/dbconfig/20220525-151612-ladsgroup.json [15:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:45] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:20:19] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:22:37] (03CR) 10Muehlenhoff: [C: 03+2] alerts: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799309 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:22:43] (03PS2) 10Muehlenhoff: alerts: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799309 (https://phabricator.wikimedia.org/T308013) [15:22:56] (03PS1) 10Giuseppe Lavagetto: deployment_server: actually pass the deployment_group to scap [puppet] - 10https://gerrit.wikimedia.org/r/799356 [15:23:45] (03CR) 10Ahmon Dancy: [C: 03+1] deployment_server: actually pass the deployment_group to scap [puppet] - 10https://gerrit.wikimedia.org/r/799356 (owner: 10Giuseppe Lavagetto) [15:24:12] (03PS1) 10Hnowlan: service: image-suggestion state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/799357 (https://phabricator.wikimedia.org/T304891) [15:26:56] (03PS1) 10Hnowlan: service: image-suggestion state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/799358 (https://phabricator.wikimedia.org/T304891) [15:27:23] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/799348 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [15:27:49] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35548/console" [puppet] - 10https://gerrit.wikimedia.org/r/799356 (owner: 10Giuseppe Lavagetto) [15:29:20] (03CR) 10Jbond: "see comment" [puppet] - 10https://gerrit.wikimedia.org/r/799354 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [15:30:30] (03CR) 10Muehlenhoff: idp::memcached: Only enable memcached_16 on Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799354 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [15:30:54] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] deployment_server: actually pass the deployment_group to scap [puppet] - 10https://gerrit.wikimedia.org/r/799356 (owner: 10Giuseppe Lavagetto) [15:31:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P28549 and previous config saved to /var/cache/conftool/dbconfig/20220525-153117-ladsgroup.json [15:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:19] <_joe_> !log deploy2002:/srv/mediawiki-staging $ find . -group wikidev -print0 | sudo xargs -0 -n 100 chgrp -h deployment -- [15:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:49] <_joe_> !log deploy2002:/srv/patches $ find . -group wikidev -print0 | sudo xargs -0 -n 100 chgrp -h deployment -- [15:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:54] <_joe_> !log deploy1002:/srv/patches $ find . -group wikidev -print0 | sudo xargs -0 -n 100 chgrp -h deployment -- [15:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:01] 10SRE, 10Gerrit: Icinga Check SSL might have a time based race condition - https://phabricator.wikimedia.org/T308908 (10hashar) From `apache-status`: ` Current Time: Wednesday, 25-May-2022 15:16:44 UTC Restart Time: Friday, 22-Apr-2022 19:59:53 UTC ` ` +-----------------------------------------------... [15:38:17] <_joe_> deploy1002:/srv/mediawiki-staging $ find . -group wikidev -print0 | sudo xargs -0 -n 100 chgrp -h deployment -- [15:38:23] <_joe_> err [15:38:28] <_joe_> !log deploy1002:/srv/mediawiki-staging $ find . -group wikidev -print0 | sudo xargs -0 -n 100 chgrp -h deployment -- [15:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:06] 10SRE, 10Traffic, 10observability, 10Upstream: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10hashar) gerrit1001.wikimedia.org is showing up the issue: >>! In T308908#7957451, @hashar wrote: > From `apache-status`: > ` > Current Time: Wedn... [15:46:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298555)', diff saved to https://phabricator.wikimedia.org/P28550 and previous config saved to /var/cache/conftool/dbconfig/20220525-154622-ladsgroup.json [15:46:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1135.eqiad.wmnet with reason: Maintenance [15:46:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1135.eqiad.wmnet with reason: Maintenance [15:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:29] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [15:46:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T298555)', diff saved to https://phabricator.wikimedia.org/P28551 and previous config saved to /var/cache/conftool/dbconfig/20220525-154630-ladsgroup.json [15:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:22] 10SRE, 10Gerrit: Icinga Check SSL might have a time based race condition - https://phabricator.wikimedia.org/T308908 (10dancy) I restarted apache2 on gerrit1001. ` +-----------------------------------------------------------------------+ | | | |Connections |Threads |Async connections... [15:50:19] (03PS1) 10Giuseppe Lavagetto: scap::master: remove the unused scap::l10nupdate class [puppet] - 10https://gerrit.wikimedia.org/r/799362 [15:58:02] (03CR) 10Vgutierrez: [C: 03+1] [WIP] esitest service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack) [16:00:16] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10BTullis) OK, 3pm est is a bit late for me, but I can shut down the two hosts with downtime before I leave for the day if that's OK with you. Once you've changed the battery you can po... [16:02:46] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:03:13] (KubernetesRsyslogDown) firing: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:10:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Papaul) @nskaggs the blocker here is the partman recipe the hosts are using. There is an issue with the recipe. [16:13:08] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10wiki_willy) Hi @BTullis - John typically gets into work a bit later in the day, but that should totally work. Thanks for checking! >>! In T308434#7957520, @BTullis wrote: > OK, 3pm... [16:20:27] (03CR) 10Jeena Huneidi: [C: 03+1] mwdebug service: Add traindev environment support [deployment-charts] - 10https://gerrit.wikimedia.org/r/798883 (owner: 10Ahmon Dancy) [16:26:22] (03CR) 10Nskaggs: Add dumps mapping to cache_upload (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793525 (https://phabricator.wikimedia.org/T306550) (owner: 10BBlack) [16:33:15] PROBLEM - Disk space on mx1001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/db is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops [16:36:38] (03PS3) 10Zabe: acme_chief: remove absented acme-chief-designate-tidyup cron [puppet] - 10https://gerrit.wikimedia.org/r/779033 (https://phabricator.wikimedia.org/T273673) [16:36:45] (03CR) 10Zabe: acme_chief: remove absented acme-chief-designate-tidyup cron (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779033 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [16:41:55] (03CR) 10Jcrespo: [C: 03+2] bernard: Changes to dashboard, add individual sections data, minor fixes [software/bernard] - 10https://gerrit.wikimedia.org/r/714748 (https://phabricator.wikimedia.org/T289441) (owner: 10H.krishna123) [16:42:13] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:42:37] (03CR) 10Jcrespo: [C: 03+2] bernard: Add simple documentation into README.md [software/bernard] - 10https://gerrit.wikimedia.org/r/714870 (https://phabricator.wikimedia.org/T289735) (owner: 10H.krishna123) [16:43:56] (03Merged) 10jenkins-bot: bernard: Changes to dashboard, add individual sections data, minor fixes [software/bernard] - 10https://gerrit.wikimedia.org/r/714748 (https://phabricator.wikimedia.org/T289441) (owner: 10H.krishna123) [16:44:39] (03Merged) 10jenkins-bot: bernard: Add simple documentation into README.md [software/bernard] - 10https://gerrit.wikimedia.org/r/714870 (https://phabricator.wikimedia.org/T289735) (owner: 10H.krishna123) [16:47:52] (03PS1) 10Zabe: sudo: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799371 (https://phabricator.wikimedia.org/T308013) [16:49:45] 10SRE, 10SRE-OnFire, 10Observability-Logging, 10Sustainability (Incident Followup), 10Wikimedia-Incident: create a sampled log of POST data - https://phabricator.wikimedia.org/T309186 (10Dzahn) [16:50:43] (03PS1) 10Zabe: statsite: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799372 (https://phabricator.wikimedia.org/T308013) [16:52:43] (03PS1) 10Zabe: statograph: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799373 (https://phabricator.wikimedia.org/T308013) [16:55:44] bd808: I see you merged the developer-portal chart - are you going to deploy that to staging too? [16:57:01] (03PS1) 10Volans: transports: allow to set a global timeout [software/homer] - 10https://gerrit.wikimedia.org/r/799375 [16:57:03] (03PS1) 10Volans: devices: allow to pass additional metadata [software/homer] - 10https://gerrit.wikimedia.org/r/799376 [16:57:27] (03CR) 10Tchanders: Deploy SimilarEditors to the beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799012 (https://phabricator.wikimedia.org/T306908) (owner: 10AGueyte) [16:59:18] (03PS1) 10Zabe: squid: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799377 (https://phabricator.wikimedia.org/T308013) [17:02:28] (03PS2) 10Ebernhardson: cirrus: Migrate popularity_score configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775965 [17:05:41] (03CR) 10Cathal Mooney: [C: 03+1] "Looks great! nice work." [software/homer] - 10https://gerrit.wikimedia.org/r/799376 (owner: 10Volans) [17:05:58] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [software/homer] - 10https://gerrit.wikimedia.org/r/799375 (owner: 10Volans) [17:07:37] (03PS1) 10AGueyte: Assign similareditors right to the checkuser group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799378 (https://phabricator.wikimedia.org/T307205) [17:07:55] taavi: yes. I got delayed by meetings, but I will be pushing to staging very soon and then potentially to the eqiad and codfw clusters as well assuming that staging works out. [17:08:02] 10SRE, 10ops-drmrs, 10DC-Ops: balance power in eqsin - https://phabricator.wikimedia.org/T309231 (10RobH) p:05Triageβ†’03Medium [17:08:40] 10SRE, 10ops-drmrs, 10DC-Ops: balance power in eqsin - https://phabricator.wikimedia.org/T309231 (10RobH) Original power figures: 22=7.1/6.3 23=5.5/5 [17:10:51] (03PS2) 10AGueyte: Deploy SimilarEditors to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799012 (https://phabricator.wikimedia.org/T306908) [17:11:39] (03CR) 10CI reject: [V: 04-1] Deploy SimilarEditors to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799012 (https://phabricator.wikimedia.org/T306908) (owner: 10AGueyte) [17:13:59] (03PS1) 10David Caro: Add readme, configure script and missing modules [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/799379 [17:14:06] (03PS3) 10AGueyte: Deploy SimilarEditors to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799012 (https://phabricator.wikimedia.org/T306908) [17:15:07] (03Abandoned) 10Cwhite: opensearch: set USE_OPENSEARCH curator env variable [puppet] - 10https://gerrit.wikimedia.org/r/787824 (https://phabricator.wikimedia.org/T301017) (owner: 10Cwhite) [17:15:47] (03CR) 10AGueyte: Deploy SimilarEditors to the beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799012 (https://phabricator.wikimedia.org/T306908) (owner: 10AGueyte) [17:17:53] (03CR) 10CI reject: [V: 04-1] Add readme, configure script and missing modules [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/799379 (owner: 10David Caro) [17:18:06] (03PS1) 10Volans: devices: override default timeout for mgmt routers [homer/public] - 10https://gerrit.wikimedia.org/r/799381 [17:18:22] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-worker1081.eqiad.wmnet with reason: T308434 [17:18:24] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-worker1081.eqiad.wmnet with reason: T308434 [17:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:29] T308434: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 [17:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:42] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on analytics1068.eqiad.wmnet with reason: T308434 [17:18:44] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on analytics1068.eqiad.wmnet with reason: T308434 [17:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:23] (03PS2) 10Volans: devices: override default timeout for mgmt routers [homer/public] - 10https://gerrit.wikimedia.org/r/799381 [17:24:18] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10BTullis) I've shut down analytics1068 and an-worker1081 so they're ready for the RAID controller battery switch, whenever you are @Jclark-ctr :+1: They both have 24 hours of downtime... [17:25:30] 10SRE, 10ops-drmrs, 10DC-Ops: balance power in eqsin - https://phabricator.wikimedia.org/T309231 (10RobH) applied the command: racadm set System.Power.Hotspare.Enable 0 to all dell servers via idrac ssh and now power is much better balanced in eqsin: 5.5/5.6 & 5.3/4.8 [17:25:37] 10SRE, 10ops-drmrs, 10DC-Ops: balance power in eqsin - https://phabricator.wikimedia.org/T309231 (10RobH) 05Openβ†’03Resolved [17:36:21] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:37:15] (03PS1) 10Jbond: nrpe: move plugins off the base nrpe class [puppet] - 10https://gerrit.wikimedia.org/r/799386 [17:38:45] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:18] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:40] (03CR) 10BCornwall: "Just a petty language change :)" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/793089 (owner: 10BCornwall) [17:54:17] (03CR) 10Dzahn: [C: 03+1] pws: simple grammar fix [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/793089 (owner: 10BCornwall) [17:56:45] (03CR) 10Klausman: [C: 03+1] ml-services: update docker image for revscoring-editquality-* [deployment-charts] - 10https://gerrit.wikimedia.org/r/799349 (https://phabricator.wikimedia.org/T309102) (owner: 10Elukey) [17:56:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298555)', diff saved to https://phabricator.wikimedia.org/P28553 and previous config saved to /var/cache/conftool/dbconfig/20220525-175648-ladsgroup.json [17:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:56] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [17:56:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host sretest1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL [17:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] dancy and jnuche: How many deployers does it take to do Train log triage with CPT deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220525T1800). [18:00:05] dancy and jnuche: May I have your attention please! MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220525T1800) [18:00:29] PROBLEM - Host sretest1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:01:21] RECOVERY - Host sretest1001 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [18:01:28] (03CR) 10Dzahn: [C: 03+1] wikimedia.org: add gitlab-new records + PTR [dns] - 10https://gerrit.wikimedia.org/r/799334 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [18:01:53] thcipriani: I'm not going to make the triage meeting. [18:03:32] dancy: hrm, I wonder if I should just cancel considering I'm out tomorrow as well. [18:04:28] (03CR) 10BCornwall: [V: 03+2 C: 03+2] pws: simple grammar fix [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/793089 (owner: 10BCornwall) [18:04:44] !log joal@deploy1002 Started deploy [airflow-dags/analytics_test@3ae51e7]: (no justification provided) [18:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:51] !log joal@deploy1002 Finished deploy [airflow-dags/analytics_test@3ae51e7]: (no justification provided) (duration: 00m 07s) [18:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:42] thcipriani: I was confused by jouncebot. It thinks the meeting is right now. I can make it tomorrow [18:07:08] But it would be reasonable to cancel. I haven't seen very many interesting log messages so far. [18:07:52] (03CR) 10Dzahn: [C: 03+1] testreduce: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799306 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:08:05] !log joal@deploy1002 Started deploy [airflow-dags/analytics_test@3ae51e7]: (no justification provided) [18:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:14] !log joal@deploy1002 Finished deploy [airflow-dags/analytics_test@3ae51e7]: (no justification provided) (duration: 00m 08s) [18:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:32] (03PS1) 10QChris: Add .gitreview [software/pampinus] - 10https://gerrit.wikimedia.org/r/799406 [18:09:34] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [software/pampinus] - 10https://gerrit.wikimedia.org/r/799406 (owner: 10QChris) [18:11:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P28554 and previous config saved to /var/cache/conftool/dbconfig/20220525-181153-ladsgroup.json [18:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:46] (03CR) 10Subramanya Sastry: [C: 03+1] testreduce: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799306 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:13:38] (03CR) 10Dzahn: [C: 03+1] base: remove "managed by puppet" notice on /etc/skel/.bashrc [puppet] - 10https://gerrit.wikimedia.org/r/798874 (owner: 10BryanDavis) [18:14:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL [18:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:25] (03CR) 10Jbond: [V: 03+2 C: 03+2] "here is an example of the command dissapearing on cp3050" [puppet] - 10https://gerrit.wikimedia.org/r/798823 (owner: 10Jbond) [18:18:47] Alright.. Train stuff. [18:19:05] Rolling wmf.13 forward to group1 [18:20:15] (03PS1) 10Ahmon Dancy: group1 wikis to 1.39.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799407 (https://phabricator.wikimedia.org/T305219) [18:20:16] (03CR) 10Ahmon Dancy: [C: 03+2] group1 wikis to 1.39.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799407 (https://phabricator.wikimedia.org/T305219) (owner: 10Ahmon Dancy) [18:20:34] dancy: btw, because you said "that sounds scary" yesterday (if that was meant for me). I figured out it's just when we try to rsync straight to an /etc/ dir on the remote side. Not happening when we use other pathes. And we are changing that now in the backup/restore script. [18:21:16] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799407 (https://phabricator.wikimedia.org/T305219) (owner: 10Ahmon Dancy) [18:22:34] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.13 refs T305219 [18:22:35] mutante: Thanks for the update! yeah, that was in response to what you said. I saw "read only filesystem" and immediately landed on filesystem corruption. :-) [18:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:40] T305219: 1.39.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T305219 [18:23:23] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.13 refs T305219 (duration: 00m 49s) [18:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:06] dancy: I actually started with "how do I force an fsck to make sure".. then got to "systemd decided you don't get that anymore the same way you used to" ..then "but you can still use tune2fs to tell it to do one if it's an ext filesystem".. then "not related to file system at all.. you can sync right into /foo in the root fs.. just not /etc" [18:24:27] What "read only filesystem" really in the error message? [18:24:30] *was? [18:24:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:25:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:56] yea, in the rsync output. but it's not actually read-only. it's some other protection like https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=30aba6656f61ed44cba445a3c0d38b296fa9e8f5 or so .. sysctl "protected" dirs [18:26:19] * dancy reads [18:26:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:36] !log joal@deploy1002 Started deploy [airflow-dags/analytics_test@3ae51e7]: (no justification provided) [18:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:46] !log joal@deploy1002 Finished deploy [airflow-dags/analytics_test@3ae51e7]: (no justification provided) (duration: 00m 09s) [18:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:51] well, I thought that's the one. "make data spoofing attacks harder" but it might not be. But definitely confirmed this is only about the /etc dir and not the fs itself [18:26:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P28555 and previous config saved to /var/cache/conftool/dbconfig/20220525-182658-ladsgroup.json [18:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:36] I think it's kind of nicer to not write into /etc/ in the first place.. so now I like the change regardless of the technical reason [18:28:00] Reasonable. [18:28:17] But now you've nerdsniped me on the technical reason. :-P [18:28:23] lol, i know [18:28:34] That link seems to be about "group writable sticky directories" which I would hope /etc is not. [18:28:41] I thought briefly "oooh.. THIS IS IT" after spending too much time on it [18:28:45] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [18:28:48] haha [18:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:53] but when I disabled that with sysctl ..it did not work either [18:29:01] nod [18:29:21] when I just used something like /gittest in the root of the root fs.. no problem there [18:29:25] also no sticky bit set on /etc [18:29:58] I'd like to see `df -h /etc` and `ld -ls /etc` [18:30:07] something is trying to protect /etc specifically. and it makes sense in the context of those attack vectors [18:30:41] Also, is there anything interesting about the target? This is a script that's booting a rescue environment to rebuild a system? [18:30:51] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Jgreen) [18:31:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:34] it's not booting a rescue environment. it does backup and restore of gitlab-data and gitlab-config though [18:31:51] [gitlab1003:~] $ df -h /etc/ [18:31:51] Filesystem Size Used Avail Use% Mounted on [18:31:51] /dev/mapper/gitlab1003--vg-root 450G 26G 401G 7% / [18:31:55] so just plain rsync ops. [18:31:59] on a normally running system. [18:32:01] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:32:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:32:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:28] rsync running as root on the target? [18:32:58] rsyncd has "use_chroot" yes. but changing that to no makes no difference either. [18:33:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:33:11] also there are 2 separate rsync jobs (systemd timers) [18:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:19] one copies the data and one copies the config [18:33:29] the one that copies the data is identical except the path [18:33:32] and never had this issue [18:33:43] all other rsync settings regarding user/chroot etc are the same [18:34:24] yea, it's just rsync executed by the puppetized timers. then there are separate bash scripts doing restore from it [18:34:52] we can just do https://gerrit.wikimedia.org/r/c/operations/puppet/+/799280 next [18:35:44] (ld isn't installed by default) [18:37:21] oops. that was supposed to do be ls. [18:37:27] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:37:37] but you can skip that. It's clear you already investigated it. [18:37:40] deeply [18:37:51] 4.0K drwxr-xr-x 102 root root 4.0K May 24 20:08 . [18:42:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298555)', diff saved to https://phabricator.wikimedia.org/P28556 and previous config saved to /var/cache/conftool/dbconfig/20220525-184203-ladsgroup.json [18:42:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1140.eqiad.wmnet with reason: Maintenance [18:42:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1140.eqiad.wmnet with reason: Maintenance [18:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:12] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [18:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:30] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:09] (03CR) 10Herron: [C: 03+1] "neat!" [alerts] - 10https://gerrit.wikimedia.org/r/799285 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [18:46:04] (03CR) 10Herron: [C: 03+1] logstash: curator support new and legacy index patterns [puppet] - 10https://gerrit.wikimedia.org/r/798982 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [18:51:29] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [18:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:27] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Connect - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:00:22] (03CR) 10Dzahn: [C: 03+1] gitlab: rsync config and data backup to same folder on replica [puppet] - 10https://gerrit.wikimedia.org/r/799280 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [19:00:38] (03Abandoned) 10Dzahn: gitlab: switch backup location to /srv, don't use /etc [puppet] - 10https://gerrit.wikimedia.org/r/799016 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [19:01:19] (03Restored) 10Dzahn: gitlab: switch backup location to /srv, don't use /etc [puppet] - 10https://gerrit.wikimedia.org/r/799016 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [19:01:46] (03CR) 10Dzahn: "wait.. so did you think this should be merged and then followed up with your change or should I just abandon this?" [puppet] - 10https://gerrit.wikimedia.org/r/799016 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [19:02:30] PROBLEM - Host cloudgw1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:02:38] PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Connect - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:03:08] RECOVERY - Host cloudgw1002 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [19:04:39] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:05:07] RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:18:23] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "nrpe: manage sudo rules via nrpe::check" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/798823 (owner: 10Jbond) [19:19:53] PROBLEM - Host analytics1068.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:22:52] 10SRE, 10Infrastructure-Foundations, 10Mail: [mitigated] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10jhathaway) [19:23:33] PROBLEM - Host an-worker1081.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:29:49] RECOVERY - Host an-worker1081.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.71 ms [19:31:10] (03CR) 10Krinkle: "I was just thinking the same thing! Sound great. We can also add a dropdown menu perhaps, populated by wgCanonicalServer or even just dbna" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [19:35:46] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10Jclark-ctr) @BTullis completed moving raid battery and powered on [19:36:41] (03CR) 10Herron: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite) [19:39:05] RECOVERY - Host analytics1068.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.18 ms [19:49:42] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10Jclark-ctr) 05Openβ†’03Resolved [19:52:07] (03CR) 10Herron: [C: 03+1] "LGTM!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/790672 (https://phabricator.wikimedia.org/T307898) (owner: 10Vgutierrez) [19:59:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [19:59:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [19:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:05] RoanKattouw, Urbanecm, and cjming: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220525T2000). [20:00:05] ebernhardson and koi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:20] Hello [20:00:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [20:00:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [20:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:59] Both of my patches have no need to test, please do a direct sync [20:02:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [20:02:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [20:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:47] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:03:13] (KubernetesRsyslogDown) firing: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:04:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [20:04:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [20:04:19] hi - i can deploy [20:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:44] koi: i'll do your patches first since Erik will be on later [20:04:53] ack [20:05:43] ? [20:07:09] nothing :) Please go ahead [20:07:40] (03CR) 10Clare Ming: [C: 03+2] zhwikivoyage: Generate zh-hant logo variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793125 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:08:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [20:08:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [20:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:22] (03Merged) 10jenkins-bot: zhwikivoyage: Generate zh-hant logo variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793125 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:09:22] (03PS4) 10Clare Ming: zhwikivoyage: Declare commons files for logo and its variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793027 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:09:58] !log cjming@deploy1002 Synchronized static/images/project-logos: Config: [[gerrit:793125|zhwikivoyage: Generate zh-hant logo variant (T308620)]] (duration: 00m 50s) [20:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:04] T308620: HIDPI support for logos among Chinese projects - https://phabricator.wikimedia.org/T308620 [20:10:30] (03CR) 10Clare Ming: [C: 03+2] zhwikivoyage: Declare commons files for logo and its variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793027 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:11:14] (03Merged) 10jenkins-bot: zhwikivoyage: Declare commons files for logo and its variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793027 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:13:42] !log cjming@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:793027|zhwikivoyage: Declare commons files for logo and its variant (T308620)]] (duration: 01m 25s) [20:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:44] !log cjming@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:793027|zhwikivoyage: Declare commons files for logo and its variant (T308620)]] (duration: 00m 49s) [20:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:55] koi: both your patches should be live [20:15:05] thx! [20:15:10] np! [20:15:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:15:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:35] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [20:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:48] hi ebernhardson: feel free to ping when you're here and I'd be happy to deploy your patch (unless you can/want to self-serve) [20:16:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:41] (03PS3) 10Clare Ming: cirrus: Migrate popularity_score configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775965 (owner: 10Ebernhardson) [20:21:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:22:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:08] 10SRE, 10User-AKlapper: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10Aklapper) [20:23:15] cjming: \o [20:23:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:38] hi ebernhardson: would you like me to deploy your patch or would you prefer to do it yourself (idk anyone's preferences)? [20:26:09] cjming: sure, you can deploy [20:26:14] alrighty [20:26:21] (03CR) 10Clare Ming: [C: 03+2] cirrus: Migrate popularity_score configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775965 (owner: 10Ebernhardson) [20:26:55] cjming: there isn't really anything to test, that config is only invoked when we create new indexes [20:26:57] RECOVERY - Disk space on mx1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops [20:27:03] (03Merged) 10jenkins-bot: cirrus: Migrate popularity_score configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775965 (owner: 10Ebernhardson) [20:27:13] ebernhardson: got it - i'll go ahead and sync then [20:28:49] !log cjming@deploy1002 Synchronized wmf-config/CirrusSearch-common.php: Config: [[gerrit:775965|cirrus: Migrate popularity_score configuration]] (duration: 00m 51s) [20:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:02] ebernhardson: should be live [20:29:11] cjming: thanks! [20:29:15] np! [20:30:43] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Incident: 2022-05-09 Exim BDAT Errors incident - https://phabricator.wikimedia.org/T309238 (10Peachey88) [20:31:08] i'm going to be bold and close this window early since yesterday's UTC late backport window went quite long [20:32:05] !log end of UTC late backport window [20:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:34:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1106.eqiad.wmnet with reason: Maintenance [20:36:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1106.eqiad.wmnet with reason: Maintenance [20:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [20:37:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [20:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T298555)', diff saved to https://phabricator.wikimedia.org/P28558 and previous config saved to /var/cache/conftool/dbconfig/20220525-203708-ladsgroup.json [20:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:18] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [20:42:13] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:58:11] (03CR) 10Joal: Disable cleanup on unused Fairscheduler for Hadoop. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799257 (owner: 10Slyngshede) [20:58:37] RECOVERY - IPMI Sensor Status on aqs1014 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [20:58:54] (03PS1) 10Stang: zhwiki: wmgSiteLogoVariants language fallback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T308620) [21:00:01] (03CR) 10CI reject: [V: 04-1] zhwiki: wmgSiteLogoVariants language fallback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [21:01:05] (03PS2) 10Stang: zhwiki: wmgSiteLogoVariants language fallback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T308620) [21:06:51] !log joal@deploy1002 Started deploy [airflow-dags/analytics_test@3ae51e7]: (no justification provided) [21:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:58] !log joal@deploy1002 Finished deploy [airflow-dags/analytics_test@3ae51e7]: (no justification provided) (duration: 00m 06s) [21:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:54] (03PS3) 10Stang: zhwiki: wmgSiteLogoVariants language fallback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T308620) [21:19:46] 10SRE, 10Data-Engineering, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10mpopov) Thank you @Milimetric for the ping! I missed this earlier in the month. > I did find one dataset where `wprov` is used, by #product-analytics, so perhaps @mpopov, w... [21:24:21] (03PS1) 10BryanDavis: developer-portal: Bump container version to 2022-05-25-165053-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/799416 [21:32:01] (03PS4) 10Stang: Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) [21:33:37] (03PS1) 10GergΕ‘ Tisza: Tombstone the old session on SessionBackend::resetId() [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/799388 (https://phabricator.wikimedia.org/T299193) [21:37:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10Jclark-ctr) [21:45:15] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:45:17] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:56] (03CR) 10Eevans: Add new Cassandra cluster for ML cache/feature-store workloads in eqiad (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey) [21:46:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10Jclark-ctr) a:05Jclark-ctrβ†’03Cmjohnson [21:47:46] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:00] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [22:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:49] PROBLEM - IPMI Sensor Status on ganeti1023 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [22:06:14] 10SRE, 10MediaWiki-Uploading, 10Structured Data Engineering, 10Structured-Data-Backlog, and 4 others: LocalFile::prerenderThumbnail should have a page limit - https://phabricator.wikimedia.org/T309114 (10Krinkle) This issue was introduced in MW 1.37 as part of [change 698367](https://gerrit.wikimedia.org/r... [22:06:40] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:48] 10SRE, 10MediaWiki-Uploading, 10MW-1.39-notes (1.39.0-wmf.14; 2022-05-30), 10Patch-For-Review, and 2 others: LocalFile::prerenderThumbnail should have a page limit - https://phabricator.wikimedia.org/T309114 (10Krinkle) a:03Jdforrester-WMF [22:13:48] (03PS1) 10Cathal Mooney: Add static routes to those exported by cloudsw to CR routers [homer/public] - 10https://gerrit.wikimedia.org/r/799419 (https://phabricator.wikimedia.org/T304936) [22:15:20] (03CR) 10Cathal Mooney: [C: 03+2] Add static routes to those exported by cloudsw to CR routers [homer/public] - 10https://gerrit.wikimedia.org/r/799419 (https://phabricator.wikimedia.org/T304936) (owner: 10Cathal Mooney) [22:15:43] (03CR) 10Krinkle: [C: 03+1] Revert "Cache Badtitle 400s for 60s in varnish-fe" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769827 (owner: 10Legoktm) [22:15:56] (03Merged) 10jenkins-bot: Add static routes to those exported by cloudsw to CR routers [homer/public] - 10https://gerrit.wikimedia.org/r/799419 (https://phabricator.wikimedia.org/T304936) (owner: 10Cathal Mooney) [22:19:52] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10Observability-Alerting: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata) StatusPage is now officially launched and in service. While relevant and still needed, the open items here are not on th... [22:20:04] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10Observability-Alerting: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata) [22:20:11] (03CR) 10Ori: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/799420 (owner: 10Ori) [22:22:13] (03PS1) 10Cathal Mooney: Remove reference to xe-3/0/4.1118 from CR routers Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/799422 (https://phabricator.wikimedia.org/T304936) [22:23:31] (03CR) 10Cathal Mooney: [C: 03+2] Remove reference to xe-3/0/4.1118 from CR routers Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/799422 (https://phabricator.wikimedia.org/T304936) (owner: 10Cathal Mooney) [22:24:20] (03Merged) 10jenkins-bot: Remove reference to xe-3/0/4.1118 from CR routers Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/799422 (https://phabricator.wikimedia.org/T304936) (owner: 10Cathal Mooney) [22:27:59] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:36:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10cmooney) a:05cmooneyβ†’03Cmjohnson @nskaggs I believe that to be the case yes. I've not been able to successfully reimage any... [22:37:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10cmooney) a:05cmooneyβ†’03Cmjohnson @nskaggs I believe that to be the case yes. I've not been able to successfully reimage any of these... [22:37:55] PROBLEM - ensure kvm processes are running on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:39:23] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:42:27] RECOVERY - ensure kvm processes are running on cloudvirt1047 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:43:40] (03PS6) 10BCornwall: cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 [22:45:21] (03CR) 10CI reject: [V: 04-1] cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall) [22:45:55] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [22:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:18] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [22:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:14] (03PS7) 10BCornwall: cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 [22:47:27] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [22:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:52] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [22:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298555)', diff saved to https://phabricator.wikimedia.org/P28559 and previous config saved to /var/cache/conftool/dbconfig/20220525-224957-ladsgroup.json [22:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:03] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [22:54:55] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:55:15] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:03:30] (03CR) 10BCornwall: "Sorry for the delay! How about this? I added the paths in a list so that future OS definitions can be straightforward." [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall) [23:04:59] (03PS1) 10BryanDavis: developer-portal: add service discovery records [dns] - 10https://gerrit.wikimedia.org/r/799427 (https://phabricator.wikimedia.org/T297140) [23:05:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P28560 and previous config saved to /var/cache/conftool/dbconfig/20220525-230502-ladsgroup.json [23:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:29] (03CR) 10BryanDavis: "Following the instructions at https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#DNS_changes" [dns] - 10https://gerrit.wikimedia.org/r/799427 (https://phabricator.wikimedia.org/T297140) (owner: 10BryanDavis) [23:06:40] (03CR) 10CI reject: [V: 04-1] developer-portal: add service discovery records [dns] - 10https://gerrit.wikimedia.org/r/799427 (https://phabricator.wikimedia.org/T297140) (owner: 10BryanDavis) [23:10:19] (03CR) 10BryanDavis: "Jerkins failure seems to be about an unrelated linter error:" [dns] - 10https://gerrit.wikimedia.org/r/799427 (https://phabricator.wikimedia.org/T297140) (owner: 10BryanDavis) [23:20:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P28561 and previous config saved to /var/cache/conftool/dbconfig/20220525-232007-ladsgroup.json [23:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:02] (03PS1) 10BryanDavis: developer-portal: add to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/799429 (https://phabricator.wikimedia.org/T297140) [23:22:08] (03CR) 10CI reject: [V: 04-1] developer-portal: add to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/799429 (https://phabricator.wikimedia.org/T297140) (owner: 10BryanDavis) [23:27:35] (03PS2) 10BryanDavis: developer-portal: add to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/799429 (https://phabricator.wikimedia.org/T297140) [23:32:48] (03CR) 10BryanDavis: developer-portal: add to service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799429 (https://phabricator.wikimedia.org/T297140) (owner: 10BryanDavis) [23:35:03] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [23:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:07] (03CR) 10Cathal Mooney: [C: 03+2] Change order that Netbox server provision script gets old/new vlan name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/799011 (https://phabricator.wikimedia.org/T304936) (owner: 10Cathal Mooney) [23:35:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298555)', diff saved to https://phabricator.wikimedia.org/P28562 and previous config saved to /var/cache/conftool/dbconfig/20220525-233512-ladsgroup.json [23:35:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1184.eqiad.wmnet with reason: Maintenance [23:35:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1184.eqiad.wmnet with reason: Maintenance [23:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:18] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [23:35:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T298555)', diff saved to https://phabricator.wikimedia.org/P28563 and previous config saved to /var/cache/conftool/dbconfig/20220525-233520-ladsgroup.json [23:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:41] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [23:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:09] (03Merged) 10jenkins-bot: Change order that Netbox server provision script gets old/new vlan name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/799011 (https://phabricator.wikimedia.org/T304936) (owner: 10Cathal Mooney) [23:40:35] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:46:57] 10SRE, 10SRE-OnFire, 10Release-Engineering-Team, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10Dzahn) The number of repositories on deploy1002 and deploy2002 under /srv/deployment is the same (as intended by the rsync, it uses --delete). The t... [23:51:26] 10SRE, 10SRE-OnFire, 10Release-Engineering-Team, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10Dzahn) Top ten oldest repos by modifiation time, oldest first: ` Oct 9 2013 elasticsearch Dec 9 2013 scholarships Apr 18 2014 ocg May 30 201... [23:54:56] 10SRE, 10DBA, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): App servers <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809 (10tstarling) [23:56:34] 10SRE, 10SRE-OnFire, 10Release-Engineering-Team, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10Dzahn) list of repositories mentioned in `hieradata/role/common/deployment_server/kubernetes.yaml` (same have a repository but also a "scap_reposit... [23:58:44] 10SRE, 10Deployments, 10Parsoid, 10bacula, 10Release-Engineering-Team (Doing): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10Dzahn) @hashar also see T309162#7958786