[00:02:11] PROBLEM - SSH on ml-serve-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:03:32] 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10Dzahn) 05Stalled→03In progress [00:04:13] RECOVERY - SSH on ml-serve-ctrl1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:05:29] PROBLEM - Check systemd state on ml-serve-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:28] (03PS1) 10Dzahn: admin: reactivate account for Mark Hershberger, add to Mediawiki releasers [puppet] - 10https://gerrit.wikimedia.org/r/773660 (https://phabricator.wikimedia.org/T302287) [00:10:33] (03CR) 10jerkins-bot: [V: 04-1] admin: reactivate account for Mark Hershberger, add to Mediawiki releasers [puppet] - 10https://gerrit.wikimedia.org/r/773660 (https://phabricator.wikimedia.org/T302287) (owner: 10Dzahn) [00:12:02] (03CR) 10MarkAHershberger: [C: 03+1] admin: reactivate account for Mark Hershberger, add to Mediawiki releasers [puppet] - 10https://gerrit.wikimedia.org/r/773660 (https://phabricator.wikimedia.org/T302287) (owner: 10Dzahn) [00:12:55] (03PS2) 10Dzahn: admin: reactivate account for Mark Hershberger, add to Mediawiki releasers [puppet] - 10https://gerrit.wikimedia.org/r/773660 (https://phabricator.wikimedia.org/T302287) [00:13:36] (03CR) 10jerkins-bot: [V: 04-1] admin: reactivate account for Mark Hershberger, add to Mediawiki releasers [puppet] - 10https://gerrit.wikimedia.org/r/773660 (https://phabricator.wikimedia.org/T302287) (owner: 10Dzahn) [00:14:03] (03CR) 10Dzahn: "sorry, PS1 failed CI because of trailing whitespace. @MarkAHershberger now:)" [puppet] - 10https://gerrit.wikimedia.org/r/773660 (https://phabricator.wikimedia.org/T302287) (owner: 10Dzahn) [00:15:28] (03PS3) 10Dzahn: admin: reactivate account for Mark Hershberger, add to Mediawiki releasers [puppet] - 10https://gerrit.wikimedia.org/r/773660 (https://phabricator.wikimedia.org/T302287) [00:16:36] (03CR) 10Dzahn: "@MarkAHershberger I did not like your key because it was missing the prefix like "ssh-ed25519" or "ssh-rsa". I am guessing it is "ssh-ed25" [puppet] - 10https://gerrit.wikimedia.org/r/773660 (https://phabricator.wikimedia.org/T302287) (owner: 10Dzahn) [00:16:50] (03CR) 10Dzahn: "s/I/it (CI did not like the key)" [puppet] - 10https://gerrit.wikimedia.org/r/773660 (https://phabricator.wikimedia.org/T302287) (owner: 10Dzahn) [00:30:33] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:34:55] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:36:59] (KubernetesCalicoDown) firing: ml-serve-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:38:53] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10Papaul) getting the message below during install ` reuse-parts: Recipe device matching failed │ │ ERROR: =dev=md0 matches zero devices │... [00:38:58] (KubernetesRsyslogDown) firing: rsyslog on ml-serve1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:39:10] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2027.codfw.wmnet with OS buster [00:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:15] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host restbase2027.codfw.wmnet with OS buster executed with errors: - restbase20... [00:45:07] (03CR) 10Razzi: [C: 03+2] karapace: remove Type=notify [puppet] - 10https://gerrit.wikimedia.org/r/773387 (https://phabricator.wikimedia.org/T301565) (owner: 10Razzi) [00:46:26] PROBLEM - SSH on ml-serve-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:59:19] (03PS1) 10Andrew Bogott: Make cloudvirt1047 a virt node [puppet] - 10https://gerrit.wikimedia.org/r/773667 (https://phabricator.wikimedia.org/T293391) [01:01:18] (03PS2) 10Andrew Bogott: Make cloudvirt1047 a virt node [puppet] - 10https://gerrit.wikimedia.org/r/773667 (https://phabricator.wikimedia.org/T293391) [01:02:16] 10SRE, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Xeno_WMF) Hello, I believe I have encountered this bug here: https://meta.wikimedia.org/w/index.php?title=Special:Log&logid=471164... [01:02:18] (03CR) 10Andrew Bogott: [C: 03+2] Make cloudvirt1047 a virt node [puppet] - 10https://gerrit.wikimedia.org/r/773667 (https://phabricator.wikimedia.org/T293391) (owner: 10Andrew Bogott) [01:03:58] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:08:58] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:10:43] PROBLEM - Check systemd state on ml-serve-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:11:39] RECOVERY - SSH on ml-serve-ctrl1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:11:58] (KubernetesCalicoDown) resolved: ml-serve-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:13:58] (KubernetesRsyslogDown) resolved: (2) rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:24:10] (03PS1) 10Razzi: sre.wikireplicas.update-views: add more options [cookbooks] - 10https://gerrit.wikimedia.org/r/773670 (https://phabricator.wikimedia.org/T297026) [01:24:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:27:38] (03CR) 10jerkins-bot: [V: 04-1] sre.wikireplicas.update-views: add more options [cookbooks] - 10https://gerrit.wikimedia.org/r/773670 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [01:30:47] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:35:49] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:38:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:38:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10Andrew) I needed to enable virtualization in the bios but now this host is in service and seems fine. thanks @papaul! [01:43:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:55] (03PS2) 10Razzi: sre.wikireplicas.update-views: add more options [cookbooks] - 10https://gerrit.wikimedia.org/r/773670 (https://phabricator.wikimedia.org/T297026) [01:50:09] (03CR) 10jerkins-bot: [V: 04-1] sre.wikireplicas.update-views: add more options [cookbooks] - 10https://gerrit.wikimedia.org/r/773670 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [02:54:48] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Andrew) 05Open→03Resolved [02:54:53] (03CR) 10NguoiDungKhongDinhDanh: "Since you claimed T303579, would you mind review this first patch of mine? Thanks a lot." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773320 (https://phabricator.wikimedia.org/T303579) (owner: 10NguoiDungKhongDinhDanh) [02:59:33] (03CR) 10AntiCompositeNumber: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773320 (https://phabricator.wikimedia.org/T303579) (owner: 10NguoiDungKhongDinhDanh) [03:01:18] (03CR) 10AntiCompositeNumber: "Please write an informative commit message that explains what is being changed and why. https://www.mediawiki.org/wiki/Gerrit/Commit_messa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773320 (https://phabricator.wikimedia.org/T303579) (owner: 10NguoiDungKhongDinhDanh) [03:28:57] PROBLEM - SSH on ml-serve-ctrl1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:29:58] (KubernetesCalicoDown) firing: ml-serve-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:33:17] RECOVERY - SSH on ml-serve-ctrl1001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:33:45] (JobUnavailable) firing: (2) Reduced availability for job k8s-api in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:34:58] (KubernetesCalicoDown) resolved: ml-serve-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:38:13] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:57:37] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:22:43] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.06 ms [05:30:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1134 for testing', diff saved to https://phabricator.wikimedia.org/P23053 and previous config saved to /var/cache/conftool/dbconfig/20220325-053037-marostegui.json [05:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:56] (03PS1) 10Marostegui: Revert "db2087: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/773693 [05:36:42] (03CR) 10Marostegui: [C: 03+2] Revert "db2087: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/773693 (owner: 10Marostegui) [05:38:32] (03PS1) 10Marostegui: db1134: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/773676 (https://phabricator.wikimedia.org/T304626) [05:40:08] (03CR) 10Marostegui: [C: 03+2] db1134: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/773676 (https://phabricator.wikimedia.org/T304626) (owner: 10Marostegui) [05:40:33] (03CR) 10Marostegui: "That sounds good, I will merge this now though so we have it done here as well." [software] - 10https://gerrit.wikimedia.org/r/773440 (https://phabricator.wikimedia.org/T303605) (owner: 10Marostegui) [05:40:35] (03CR) 10Marostegui: [C: 03+2] switchover-tmpl.sh: Add "Affected wikis" field [software] - 10https://gerrit.wikimedia.org/r/773440 (https://phabricator.wikimedia.org/T303605) (owner: 10Marostegui) [05:46:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1127.eqiad.wmnet with reason: Maintenance [05:47:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1127.eqiad.wmnet with reason: Maintenance [05:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T302658)', diff saved to https://phabricator.wikimedia.org/P23054 and previous config saved to /var/cache/conftool/dbconfig/20220325-054705-marostegui.json [05:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:11] T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658 [05:52:01] (03CR) 10Marostegui: Add fix_user_varbinaries_T298565.py (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773655 (https://phabricator.wikimedia.org/T298565) (owner: 10Ladsgroup) [05:55:21] (03CR) 10Marostegui: [C: 03+2] filtered_tables.txt: remove gu_enabled and gu_enabled_method columns [puppet] - 10https://gerrit.wikimedia.org/r/773616 (https://phabricator.wikimedia.org/T303266) (owner: 10Zabe) [06:07:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112 for schema change', diff saved to https://phabricator.wikimedia.org/P23055 and previous config saved to /var/cache/conftool/dbconfig/20220325-060723-marostegui.json [06:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:30] PROBLEM - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:11:32] RECOVERY - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 196 bytes in 1.012 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:18:06] PROBLEM - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:18:49] On phone but here [06:19:00] Will soon get to laptop [06:25:56] <_joe_> taking a look now [06:28:01] Thanks [06:28:31] (03PS2) 10Ladsgroup: Add fix_user_varbinaries_T298565.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773655 (https://phabricator.wikimedia.org/T298565) [06:28:46] (03CR) 10Ladsgroup: Add fix_user_varbinaries_T298565.py (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773655 (https://phabricator.wikimedia.org/T298565) (owner: 10Ladsgroup) [06:29:20] (03CR) 10Marostegui: [C: 03+1] Add fix_user_varbinaries_T298565.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773655 (https://phabricator.wikimedia.org/T298565) (owner: 10Ladsgroup) [06:29:46] (03CR) 10Ladsgroup: [C: 03+2] Add fix_user_varbinaries_T298565.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773655 (https://phabricator.wikimedia.org/T298565) (owner: 10Ladsgroup) [06:29:51] !log dbmaint s4@eqiad T300775 [06:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:57] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [06:30:01] <_joe_> Amir1: if you want to take a look too, basically we had a huge cpu spike [06:30:10] (03Merged) 10jenkins-bot: Add fix_user_varbinaries_T298565.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773655 (https://phabricator.wikimedia.org/T298565) (owner: 10Ladsgroup) [06:31:24] <_joe_> !log deleting a couple zotero pods with excessive number of restarts [06:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:44] <_joe_> not sure why the zotero page hasn't rec overred [06:33:37] it might take a while? [06:33:56] <_joe_> no, it actually looks like it's just not working [06:37:30] <_joe_> Amir1: basically what I did was [06:37:43] <_joe_> look at pod details here https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s&var-namespace=zotero&var-pod=All [06:37:55] <_joe_> as root [06:37:59] <_joe_> kube_env admin eqiad [06:38:14] <_joe_> kubectl -n zotero delete pod [06:39:53] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:41:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [06:41:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [06:41:35] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23056 and previous config saved to /var/cache/conftool/dbconfig/20220325-064139-ladsgroup.json [06:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:45] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:41:52] RECOVERY - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 196 bytes in 1.016 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:42:00] <_joe_> took your time icinga-wm [06:42:05] haha [06:42:27] maybe it should kill them automatically? :D [06:43:20] <_joe_> Amir1: let me ask you this - do you think kubernetes doesn't have such facilities? but zotero is resistant to any decent production practice [06:43:33] <_joe_> (and even worse, newer versions of zotero don't have a server component) [06:43:45] sigh [06:44:01] <_joe_> basically you normally gather if a pod is able to serve traffic using a readiness probe [06:44:09] <_joe_> which we can't have in zotero lol [06:44:22] <_joe_> anyways [06:44:38] <_joe_> ttyl [06:44:40] Thanks for fixing this. I'm in a train with spotty connection [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220325T0700) [07:10:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23057 and previous config saved to /var/cache/conftool/dbconfig/20220325-071054-ladsgroup.json [07:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:00] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:17:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10elukey) 05Resolved→03Open Hi Chris! I noticed that we have two nodes on the same ROW, would it be possible to move one elsewhere? We are going to h... [07:18:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T302658)', diff saved to https://phabricator.wikimedia.org/P23058 and previous config saved to /var/cache/conftool/dbconfig/20220325-071840-marostegui.json [07:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:46] T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658 [07:26:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P23059 and previous config saved to /var/cache/conftool/dbconfig/20220325-072559-ladsgroup.json [07:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:25] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] "Yes I believe the end goal is dynamic text indeed" [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/773456 (https://phabricator.wikimedia.org/T304585) (owner: 10Phedenskog) [07:30:18] PROBLEM - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:30:21] (03CR) 10Filippo Giunchedi: "Thanks for the review Cole! Peter, mind submitting a patch for https://phabricator.wikimedia.org/T304587 too so we can bundle (hah!) the p" [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/773456 (https://phabricator.wikimedia.org/T304585) (owner: 10Phedenskog) [07:31:19] hah, re occurrence of zotero throwing its toys out of the pram [07:31:23] ? [07:33:28] (03PS22) 10Elukey: Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) [07:33:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P23060 and previous config saved to /var/cache/conftool/dbconfig/20220325-073345-marostegui.json [07:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:00] (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:35:43] there are a couple of pods with throttled cpu [07:36:42] RECOVERY - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 197 bytes in 1.016 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:40:57] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:41:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P23061 and previous config saved to /var/cache/conftool/dbconfig/20220325-074105-ladsgroup.json [07:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P23062 and previous config saved to /var/cache/conftool/dbconfig/20220325-074850-marostegui.json [07:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:35] (03PS1) 10Filippo Giunchedi: hieradata: change puppetdb-api probe to check for 200 status code [puppet] - 10https://gerrit.wikimedia.org/r/773740 [07:56:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23063 and previous config saved to /var/cache/conftool/dbconfig/20220325-075610-ladsgroup.json [07:56:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [07:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:15] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:56:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [07:56:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [07:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [07:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:05] RECOVERY - Host ms-be1071 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [07:57:34] (03PS23) 10Elukey: Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) [08:02:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [08:02:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [08:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T302658)', diff saved to https://phabricator.wikimedia.org/P23064 and previous config saved to /var/cache/conftool/dbconfig/20220325-080355-marostegui.json [08:03:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance [08:03:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance [08:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:00] T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658 [08:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T302658)', diff saved to https://phabricator.wikimedia.org/P23065 and previous config saved to /var/cache/conftool/dbconfig/20220325-080403-marostegui.json [08:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM now." [puppet] - 10https://gerrit.wikimedia.org/r/724049 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [08:05:48] (03PS24) 10Elukey: Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) [08:05:51] <_joe_> taavi: sorry for the delay, I needed to check the apache docs about QSA [08:11:31] (03CR) 10Elukey: [C: 04-1] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [08:15:06] (03PS25) 10Elukey: Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) [08:15:38] (03CR) 10jerkins-bot: [V: 04-1] Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [08:16:20] (03CR) 10Muehlenhoff: [C: 03+2] Enable Ganeti 3 for ganeti-test* [puppet] - 10https://gerrit.wikimedia.org/r/773564 (owner: 10Muehlenhoff) [08:24:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [08:24:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [08:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23066 and previous config saved to /var/cache/conftool/dbconfig/20220325-082446-ladsgroup.json [08:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:53] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:26:39] (03PS26) 10Elukey: Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) [08:30:24] 10SRE, 10serviceops: Clean up old Docker images on deneb - https://phabricator.wikimedia.org/T287222 (10JMeybohm) There now is a prune timer, see T304644 [08:40:30] (03CR) 10Elukey: [C: 04-1] "Had a chat with Joe, the cni define should represent only a config/kubeconfig, and not a list of them. I am going to rework this change to" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [08:46:53] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (7) node(s) change every puppet run: cloudcontrol1003, cloudcontrol1004, cp1085, deploy1002, deploy2002, ms-be1068, ms-be1071 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [08:48:02] (03PS27) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) [08:48:35] (03CR) 10jerkins-bot: [V: 04-1] WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [08:49:32] (03PS28) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) [08:51:05] (03PS29) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) [08:51:40] (03CR) 10Jcrespo: [C: 03+2] Add new command line utility to update existing metadata [software/mediabackups] - 10https://gerrit.wikimedia.org/r/773444 (https://phabricator.wikimedia.org/T299764) (owner: 10Jcrespo) [08:52:15] (03CR) 10JMeybohm: [C: 04-1] kubernetes: clean up extra netboot and host settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773520 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:53:49] (03CR) 10JMeybohm: [C: 03+1] Initial debianization of istio-cni (032 comments) [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [08:53:51] (03PS1) 10Majavah: wmcs: toolforge: k8s: show output of deploy.sh [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773743 [08:54:46] (03PS4) 10Elukey: kubernetes: clean up extra netboot and host settings [puppet] - 10https://gerrit.wikimedia.org/r/773520 (https://phabricator.wikimedia.org/T300744) [08:55:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T302658)', diff saved to https://phabricator.wikimedia.org/P23067 and previous config saved to /var/cache/conftool/dbconfig/20220325-085508-marostegui.json [08:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:14] T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658 [08:55:20] (03CR) 10Elukey: kubernetes: clean up extra netboot and host settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773520 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:56:42] (03CR) 10JMeybohm: [C: 03+1] kubernetes: clean up extra netboot and host settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773520 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:58:27] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (7) node(s) change every puppet run: cloudcontrol1003, cloudcontrol1004, cp1085, deploy1002, deploy2002, ms-be1068, ms-be1071 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [08:58:54] (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: k8s: show output of deploy.sh [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773743 (owner: 10Majavah) [09:01:07] (03PS30) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) [09:02:30] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34559/console" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [09:06:39] (03PS5) 10Elukey: kubernetes: clean up extra netboot and host settings [puppet] - 10https://gerrit.wikimedia.org/r/773520 (https://phabricator.wikimedia.org/T300744) [09:10:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P23068 and previous config saved to /var/cache/conftool/dbconfig/20220325-091013-marostegui.json [09:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:57] (03CR) 10David Caro: [C: 03+2] "The error seems unrelated (happens in one of the sre cookbooks), will rebase the branch see if it fixes it, but this can go in." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773743 (owner: 10Majavah) [09:19:55] (03PS1) 10David Caro: discovery: remove unneeded protected-access supression [cookbooks] - 10https://gerrit.wikimedia.org/r/773744 [09:20:05] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/773520 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [09:23:26] (03CR) 10David Caro: [C: 03+2] "See https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/773744 for the test fix, will rebase on top of master once that is merged" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773743 (owner: 10Majavah) [09:25:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23069 and previous config saved to /var/cache/conftool/dbconfig/20220325-092500-ladsgroup.json [09:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:07] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:25:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P23070 and previous config saved to /var/cache/conftool/dbconfig/20220325-092518-marostegui.json [09:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:54] !log updating libapache2-mod-auth-cas on moscovium/debmonitor1002 [09:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:38] (03CR) 10JMeybohm: [C: 04-1] Add helm charts and a helmfile configuration for datahub (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [09:31:41] (03PS1) 10Jelto: gitlab_runner: add option to drop Docker capabilities [puppet] - 10https://gerrit.wikimedia.org/r/773746 (https://phabricator.wikimedia.org/T295481) [09:32:12] I am not getting any V: +2 from jenkins, checked https://integration.wikimedia.org/zuul/ but didn't see strange things [09:32:37] hashar (if you are around) --^ o/ [09:33:43] (03PS1) 10Filippo Giunchedi: sre: add ProbeDown paging alert for enabled services [alerts] - 10https://gerrit.wikimedia.org/r/773747 (https://phabricator.wikimedia.org/T291946) [09:40:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23071 and previous config saved to /var/cache/conftool/dbconfig/20220325-094006-ladsgroup.json [09:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T302658)', diff saved to https://phabricator.wikimedia.org/P23072 and previous config saved to /var/cache/conftool/dbconfig/20220325-094023-marostegui.json [09:40:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance [09:40:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance [09:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:28] T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658 [09:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T302658)', diff saved to https://phabricator.wikimedia.org/P23073 and previous config saved to /var/cache/conftool/dbconfig/20220325-094031-marostegui.json [09:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:40] (03CR) 10Elukey: [C: 03+2] kubernetes: clean up extra netboot and host settings [puppet] - 10https://gerrit.wikimedia.org/r/773520 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [09:44:10] (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: k8s: show output of deploy.sh [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773743 (owner: 10Majavah) [09:46:48] (03CR) 10David Caro: [C: 03+2] "Hahahah, of course it would not pass the gating tests xd, not sure what was I thinking" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773743 (owner: 10Majavah) [09:47:19] (03PS2) 10David Caro: wmcs: toolforge: k8s: show output of deploy.sh [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773743 (owner: 10Majavah) [09:47:34] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34560/console" [puppet] - 10https://gerrit.wikimedia.org/r/773746 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [09:48:22] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: change puppetdb-api probe to check for 200 status code [puppet] - 10https://gerrit.wikimedia.org/r/773740 (owner: 10Filippo Giunchedi) [09:50:19] (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: k8s: show output of deploy.sh [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773743 (owner: 10Majavah) [09:54:54] /12 [09:54:56] uff [09:55:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23074 and previous config saved to /var/cache/conftool/dbconfig/20220325-095511-ladsgroup.json [09:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:39] (03PS1) 10Majavah: toolforge: remove ingress-nginx manifests [puppet] - 10https://gerrit.wikimedia.org/r/773750 [10:05:58] (03PS2) 10Elukey: decommission kubernetes[12]00[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/771850 (https://phabricator.wikimedia.org/T303044) (owner: 10Alexandros Kosiaris) [10:08:55] (03PS1) 10Elukey: kubernetes: apply devicemapper settings to kubernetes[12]00[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/773751 (https://phabricator.wikimedia.org/T300744) [10:10:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23075 and previous config saved to /var/cache/conftool/dbconfig/20220325-101016-ladsgroup.json [10:10:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [10:10:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [10:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:21] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [10:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:38] (03PS2) 10Elukey: kubernetes: apply devicemapper settings to kubernetes[12]00[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/773751 (https://phabricator.wikimedia.org/T300744) [10:11:06] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1005.eqiad.wmnet [10:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:33] (03CR) 10Elukey: [C: 03+2] kubernetes: apply devicemapper settings to kubernetes[12]00[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/773751 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [10:12:26] (03CR) 10JMeybohm: [C: 03+1] kubernetes: apply devicemapper settings to kubernetes[12]00[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/773751 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [10:17:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T302658)', diff saved to https://phabricator.wikimedia.org/P23076 and previous config saved to /var/cache/conftool/dbconfig/20220325-101701-marostegui.json [10:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:08] T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658 [10:18:23] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1005.eqiad.wmnet [10:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:39] (03CR) 10Arturo Borrero Gonzalez: "Given there could be some interim confusion, would you please leave links to the gitlab repo everywhere?" [puppet] - 10https://gerrit.wikimedia.org/r/773750 (owner: 10Majavah) [10:22:44] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1008.eqiad.wmnet [10:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:38] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban: Create conda .deb and docker image - https://phabricator.wikimedia.org/T304450 (10MoritzMuehlenhoff) >>! In T304450#7797902, @Ottomata wrote: > @MoritzMuehlenhoff advice? Can I import [[ https://docs.conda.io/projects/conda/en/latest/user-guide/install/r... [10:28:55] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P23077 and previous config saved to /var/cache/conftool/dbconfig/20220325-103207-marostegui.json [10:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [10:33:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [10:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23078 and previous config saved to /var/cache/conftool/dbconfig/20220325-103310-ladsgroup.json [10:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:16] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [10:33:27] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1008.eqiad.wmnet [10:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "This conflicts with https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/773509" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773743 (owner: 10Majavah) [10:45:53] (03PS39) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [10:46:14] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [10:46:52] (03CR) 10Giuseppe Lavagetto: Introduce requestctl (034 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [10:47:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P23079 and previous config saved to /var/cache/conftool/dbconfig/20220325-104712-marostegui.json [10:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] keepalived: use version from bullseye-bpo [puppet] - 10https://gerrit.wikimedia.org/r/773585 (https://phabricator.wikimedia.org/T304598) (owner: 10Arturo Borrero Gonzalez) [10:50:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: don't install kernel or nft from backports [puppet] - 10https://gerrit.wikimedia.org/r/773586 (https://phabricator.wikimedia.org/T304598) (owner: 10Arturo Borrero Gonzalez) [10:53:05] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:53:07] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [10:53:48] (03CR) 10Jbond: "thanks 😊" [puppet] - 10https://gerrit.wikimedia.org/r/773740 (owner: 10Filippo Giunchedi) [10:55:15] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:02:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T302658)', diff saved to https://phabricator.wikimedia.org/P23080 and previous config saved to /var/cache/conftool/dbconfig/20220325-110217-marostegui.json [11:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:22] T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658 [11:05:59] (03CR) 10Jbond: [C: 04-2] "this job is also responsible for downloading the following datasets which AFAIK are very much in use" [puppet] - 10https://gerrit.wikimedia.org/r/773648 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn) [11:07:26] (03CR) 10Elukey: [V: 03+2 C: 03+2] Initial debianization of istio-cni [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [11:08:11] (03CR) 10Jbond: [C: 03+1] "lgtm thx" [puppet] - 10https://gerrit.wikimedia.org/r/773660 (https://phabricator.wikimedia.org/T302287) (owner: 10Dzahn) [11:08:51] (03PS1) 10DCausse: team-search-platform: add jvmquake alerting [alerts] - 10https://gerrit.wikimedia.org/r/773758 (https://phabricator.wikimedia.org/T293862) [11:12:53] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23081 and previous config saved to /var/cache/conftool/dbconfig/20220325-112145-ladsgroup.json [11:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:52] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [11:24:01] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade. [11:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:17] (03Abandoned) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: support shorter bastion prefix [puppet] - 10https://gerrit.wikimedia.org/r/749744 (owner: 10Arturo Borrero Gonzalez) [11:28:03] (03PS7) 10Giuseppe Lavagetto: Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471) [11:28:05] (03PS1) 10Giuseppe Lavagetto: Add debian packaging for requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/773760 [11:29:54] (03CR) 10jerkins-bot: [V: 04-1] Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [11:30:03] (03CR) 10jerkins-bot: [V: 04-1] Add debian packaging for requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/773760 (owner: 10Giuseppe Lavagetto) [11:30:35] PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:32:29] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10jbond) p:05Triage→03Medium [11:32:33] (03PS21) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [11:33:07] (03Abandoned) 10Arturo Borrero Gonzalez: UNTESTED: openstack: neutron: refresh API policy to allow port management [puppet] - 10https://gerrit.wikimedia.org/r/606991 (https://phabricator.wikimedia.org/T255670) (owner: 10Arturo Borrero Gonzalez) [11:33:34] (03Abandoned) 10Arturo Borrero Gonzalez: nftables: introduce nft-check exec [puppet] - 10https://gerrit.wikimedia.org/r/651453 (owner: 10Arturo Borrero Gonzalez) [11:34:00] (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:35:47] 10SRE, 10Data-Engineering-Radar, 10Traffic: Lock-in Varnish and VarnishKafka versions - https://phabricator.wikimedia.org/T304617 (10jbond) p:05Triage→03Medium [11:35:54] (03PS11) 10MVernon: puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) [11:36:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23082 and previous config saved to /var/cache/conftool/dbconfig/20220325-113651-ladsgroup.json [11:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:53] (03PS22) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [11:51:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23083 and previous config saved to /var/cache/conftool/dbconfig/20220325-115156-ladsgroup.json [11:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:46] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [11:53:46] (03CR) 10MVernon: swift: deploy swift_ring_manager to one node per cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [11:53:55] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [12:07:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23084 and previous config saved to /var/cache/conftool/dbconfig/20220325-120701-ladsgroup.json [12:07:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [12:07:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [12:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:06] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [12:07:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23085 and previous config saved to /var/cache/conftool/dbconfig/20220325-120708-ladsgroup.json [12:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:18] (03CR) 10Jbond: "So you accidentally ended up with me on you CR and now i have reviewed it :/" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [12:16:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23086 and previous config saved to /var/cache/conftool/dbconfig/20220325-121623-ladsgroup.json [12:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:29] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [12:21:12] 10SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for User:TheDJ - https://phabricator.wikimedia.org/T304120 (10jbond) 05Stalled→03Resolved a:03jbond thanks @KFrancis @TheDJ Access has been granted you should be able to access the requested resources now, please let me know if yu have any issues [12:22:47] 10SRE, 10Wikimedia-Mailing-lists: Email spam from varying tawk.email addresses - https://phabricator.wikimedia.org/T304390 (10jbond) p:05Triage→03Medium [12:28:13] (03CR) 10MarkAHershberger: admin: reactivate account for Mark Hershberger, add to Mediawiki releasers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773660 (https://phabricator.wikimedia.org/T302287) (owner: 10Dzahn) [12:28:53] 10SRE, 10VPS-project-Codesearch: Add operations/software/purged to Codesearch - https://phabricator.wikimedia.org/T303434 (10jbond) [12:31:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23088 and previous config saved to /var/cache/conftool/dbconfig/20220325-123128-ladsgroup.json [12:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:23] 10SRE, 10VPS-project-Codesearch: Add operations/software/purged to Codesearch - https://phabricator.wikimedia.org/T303434 (10jbond) Im not too familiar with code search so not sure wht does and doesn't make senses but tagging a few project owners @joe pcc @Volans anything extra you can think of e.g. cumin, deb... [12:43:50] 10SRE, 10VPS-project-Codesearch: Add operations/software/purged to Codesearch - https://phabricator.wikimedia.org/T303434 (10jbond) p:05Triage→03Medium [12:46:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23089 and previous config saved to /var/cache/conftool/dbconfig/20220325-124633-ladsgroup.json [12:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:08] !log Updated operations/dumps/dcat on snapshot10(08|09|11|12|13) from d4886f6 to a1f46e4 [12:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:34] (03CR) 10Hoo man: "I just deployed this change it should take effect whenever dcat.rdf is re-generated next (probably early next week)." [dumps/dcat] - 10https://gerrit.wikimedia.org/r/773490 (owner: 10Abbe98) [12:51:23] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10MoritzMuehlenhoff) >>! In T297913#7805533, @RobH wrote: > Also unable to determine how to poll for virtual disk IDs, other htan dropping into raid bios, which won't work out for production. I need to ke... [12:59:04] (03PS2) 10Muehlenhoff: mediabackup::storage: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/771560 [12:59:40] 10SRE, 10VPS-project-Codesearch, 10Patch-For-Review: Add operations/software/purged to Codesearch - https://phabricator.wikimedia.org/T303434 (10jbond) moritz suggested we should just add all software we maintain so ill create a cr to do that [13:00:15] (03PS3) 10BBlack: geodns: remove geo-maps-esams-offline hack [dns] - 10https://gerrit.wikimedia.org/r/771631 (https://phabricator.wikimedia.org/T304089) [13:00:17] (03PS4) 10BBlack: geodns: add drmrs fallback for esams to whole map [dns] - 10https://gerrit.wikimedia.org/r/771632 (https://phabricator.wikimedia.org/T304089) [13:01:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23090 and previous config saved to /var/cache/conftool/dbconfig/20220325-130138-ladsgroup.json [13:01:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:01:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:46] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:01:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23091 and previous config saved to /var/cache/conftool/dbconfig/20220325-130146-ladsgroup.json [13:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:08] 10SRE, 10VPS-project-Codesearch, 10Patch-For-Review: Add operations/software/purged to Codesearch - https://phabricator.wikimedia.org/T303434 (10Joe) Other things to add that are not under `operations/software`: * `operations/docker-images/docker-pkg` * `operations/docker-images/docker-report` * `operations... [13:08:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P23092 and previous config saved to /var/cache/conftool/dbconfig/20220325-130834-root.json [13:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:44] (03PS1) 10Muehlenhoff: klaxone: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/773772 [13:17:21] (03CR) 10jerkins-bot: [V: 04-1] klaxone: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/773772 (owner: 10Muehlenhoff) [13:20:57] 10SRE, 10Data-Persistence-Backup, 10media-backups, 10Goal, 10Patch-For-Review: Document media recovery use case proposals and decide their priority - https://phabricator.wikimedia.org/T299764 (10jcrespo) Because performing backups takes multiple days, the following issues have been detected: * Some file... [13:22:55] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade. [13:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P23093 and previous config saved to /var/cache/conftool/dbconfig/20220325-132338-root.json [13:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23094 and previous config saved to /var/cache/conftool/dbconfig/20220325-132746-ladsgroup.json [13:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:53] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:27:55] (03CR) 10MVernon: swift: deploy swift_ring_manager to one node per cluster (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [13:28:57] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:55] RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:34:36] (03PS2) 10Muehlenhoff: klaxone: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/773772 [13:35:14] (03CR) 10jerkins-bot: [V: 04-1] klaxone: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/773772 (owner: 10Muehlenhoff) [13:37:29] (03CR) 10RhinosF1: klaxone: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773772 (owner: 10Muehlenhoff) [13:38:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P23095 and previous config saved to /var/cache/conftool/dbconfig/20220325-133842-root.json [13:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:35] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:42:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23096 and previous config saved to /var/cache/conftool/dbconfig/20220325-134251-ladsgroup.json [13:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:03] RECOVERY - Host ms-be1070 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [13:46:38] (03PS3) 10Muehlenhoff: klaxon: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/773772 [13:48:02] (03PS1) 10Phedenskog: Add marcusolsson-dynamic-text plugin. [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/773778 (https://phabricator.wikimedia.org/T304587) [13:49:59] (03CR) 10Muehlenhoff: klaxon: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773772 (owner: 10Muehlenhoff) [13:50:08] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/773772 (owner: 10Muehlenhoff) [13:53:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P23097 and previous config saved to /var/cache/conftool/dbconfig/20220325-135346-root.json [13:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:47] (03CR) 10CDanis: [C: 03+1] klaxon: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/773772 (owner: 10Muehlenhoff) [13:57:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23098 and previous config saved to /var/cache/conftool/dbconfig/20220325-135756-ladsgroup.json [13:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:31] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10cmooney) Just an update here. Juniper have been able to confirm that this is a bug in their implementation of ARP on this platform. TL;DR what happens on a... [14:02:32] (03PS1) 10Muehlenhoff: certspotter: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/773780 [14:03:06] (03CR) 10jerkins-bot: [V: 04-1] certspotter: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/773780 (owner: 10Muehlenhoff) [14:04:08] (03PS2) 10Muehlenhoff: certspotter: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/773780 [14:07:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/773780 (owner: 10Muehlenhoff) [14:08:42] (03PS1) 10Jelto: gitlab: add version check to restore script [puppet] - 10https://gerrit.wikimedia.org/r/773783 (https://phabricator.wikimedia.org/T274463) [14:08:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P23099 and previous config saved to /var/cache/conftool/dbconfig/20220325-140850-root.json [14:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:31] (03CR) 10Hashar: docker: move pruning to new profile docker::prune (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi) [14:10:56] (03CR) 10Filippo Giunchedi: [C: 03+1] Add marcusolsson-dynamic-text plugin. [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/773778 (https://phabricator.wikimedia.org/T304587) (owner: 10Phedenskog) [14:10:57] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms [14:11:26] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34563/console" [puppet] - 10https://gerrit.wikimedia.org/r/773783 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [14:13:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23100 and previous config saved to /var/cache/conftool/dbconfig/20220325-141301-ladsgroup.json [14:13:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:13:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:06] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:45] (03PS6) 10Hashar: docker: move pruning to new profile docker::prune [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi) [14:19:23] (03CR) 10Hashar: docker: move pruning to new profile docker::prune (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi) [14:24:45] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10CDanis) [[ https://status.wikimedia.org | status.wikimedia.org ]] is now up-to-date... [14:26:48] (03CR) 10David Caro: [C: 03+2] wmcs.backy2: add link to the runbook for backup_vms [puppet] - 10https://gerrit.wikimedia.org/r/772839 (https://phabricator.wikimedia.org/T304408) (owner: 10David Caro) [14:27:19] (03PS1) 10Hashar: ci: docker system prune on ci::master [puppet] - 10https://gerrit.wikimedia.org/r/773784 [14:27:41] (03CR) 10Hashar: "For the production hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi) [14:28:22] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10jhathaway) a:03jhathaway [14:35:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [14:35:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [14:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23101 and previous config saved to /var/cache/conftool/dbconfig/20220325-143545-ladsgroup.json [14:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:53] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:39:23] (03CR) 10Hashar: beta::autoupdater: Remove more obsolete stuff after scap prep auto (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/753787 (owner: 10Ahmon Dancy) [15:01:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23107 and previous config saved to /var/cache/conftool/dbconfig/20220325-150141-ladsgroup.json [15:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:48] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:07:40] (03PS1) 10Elukey: aptrepo: add component for istio 1.9.5 [puppet] - 10https://gerrit.wikimedia.org/r/773791 (https://phabricator.wikimedia.org/T297612) [15:10:02] (03CR) 10Elukey: [C: 03+2] aptrepo: add component for istio 1.9.5 [puppet] - 10https://gerrit.wikimedia.org/r/773791 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [15:13:58] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:14:57] (03PS1) 10Jbond: POC: P:thanos::swift::frontend: move ring manager config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/773794 [15:14:59] (03PS1) 10Jbond: P:thanos::swift: demo changing wieghts and draining [puppet] - 10https://gerrit.wikimedia.org/r/773795 [15:15:40] (03CR) 10jerkins-bot: [V: 04-1] POC: P:thanos::swift::frontend: move ring manager config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/773794 (owner: 10Jbond) [15:15:50] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:16:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23108 and previous config saved to /var/cache/conftool/dbconfig/20220325-151647-ladsgroup.json [15:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:31] (03CR) 10Cwhite: [V: 03+2 C: 03+2] "Build succeeds. Installed on grafana-next.wm.o for testing." [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/773778 (https://phabricator.wikimedia.org/T304587) (owner: 10Phedenskog) [15:22:18] (03CR) 10Herron: [C: 03+1] "LGTM overall, one question inline" [alerts] - 10https://gerrit.wikimedia.org/r/773747 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [15:23:39] 10SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for User:TheDJ - https://phabricator.wikimedia.org/T304120 (10TheDJ) access confirmed [15:26:36] (03CR) 10Ayounsi: "Overall LGTM, some suggestions inline." [homer/public] - 10https://gerrit.wikimedia.org/r/773587 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [15:27:38] (03CR) 10David Caro: wmcs: toolforge: k8s: factorize build code into a class (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773510 (owner: 10Arturo Borrero Gonzalez) [15:30:25] (03CR) 10Jbond: swift: deploy swift_ring_manager to one node per cluster (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [15:30:29] (03PS1) 10Filippo Giunchedi: pontoon: use vendor_modules during bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/773799 [15:31:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23109 and previous config saved to /var/cache/conftool/dbconfig/20220325-153152-ladsgroup.json [15:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:00] (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:01] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: use vendor_modules during bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/773799 (owner: 10Filippo Giunchedi) [15:36:50] (03PS2) 10Jbond: POC: P:thanos::swift::frontend: move ring manager config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/773794 [15:37:38] (03CR) 10jerkins-bot: [V: 04-1] POC: P:thanos::swift::frontend: move ring manager config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/773794 (owner: 10Jbond) [15:38:40] (03CR) 10Elukey: [C: 03+1] "Had a chat with Janis about adding a note in the docs that the first tlsHostname will be used as CN, LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/773255 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [15:40:56] (03PS2) 10JMeybohm: Allow multiple tlsHostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/773255 (https://phabricator.wikimedia.org/T290966) [15:43:11] (03PS1) 10Btullis: Add an alert for zero messages being generated by varnishkafka instances [alerts] - 10https://gerrit.wikimedia.org/r/773801 (https://phabricator.wikimedia.org/T300246) [15:45:00] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3548 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:45:05] 10ops-codfw: Document codfw breakout patch pannels in Netbox - https://phabricator.wikimedia.org/T304710 (10ayounsi) p:05Triage→03Low [15:45:34] 10SRE, 10ops-codfw: Document codfw breakout patch pannels in Netbox - https://phabricator.wikimedia.org/T304710 (10ayounsi) [15:46:01] 10SRE, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Add linecard diversity to the router-to-router interconnect in codfw - https://phabricator.wikimedia.org/T248506 (10ayounsi) [15:46:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23110 and previous config saved to /var/cache/conftool/dbconfig/20220325-154658-ladsgroup.json [15:46:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [15:47:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [15:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:04] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:47:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23111 and previous config saved to /var/cache/conftool/dbconfig/20220325-154705-ladsgroup.json [15:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:26] (03CR) 10Phuedx: "> We should definitely actually run the tests before we merge this as well" [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx) [15:49:51] (03CR) 10Ahmon Dancy: [C: 03+1] P:scap::dsh: Add scap targets as a dsh group [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [15:50:14] (03CR) 10Ahmon Dancy: [C: 03+1] wmflib: add class_hosts [puppet] - 10https://gerrit.wikimedia.org/r/771437 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [15:50:50] (03CR) 10JMeybohm: [C: 03+2] Allow multiple tlsHostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/773255 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [15:51:07] (03CR) 10JMeybohm: [C: 03+2] Add correct tlsHostnames and extra SAN to datahub cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/773256 (https://phabricator.wikimedia.org/T303049) (owner: 10JMeybohm) [15:51:14] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3387 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:52:42] (03PS2) 10Btullis: Add an alert for zero messages being generated by varnishkafka instances [alerts] - 10https://gerrit.wikimedia.org/r/773801 (https://phabricator.wikimedia.org/T300246) [15:53:27] 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) p:05Triage→03Medium [15:54:10] 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) [15:54:50] (03Merged) 10jenkins-bot: Allow multiple tlsHostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/773255 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [15:55:07] (03CR) 10Ahmon Dancy: "just typo nits. The change LGTM otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/773784 (owner: 10Hashar) [15:56:56] (03CR) 10Dzahn: geoip::data::maxmind: deactivate timer for downloading of legacy DBs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773648 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn) [16:00:08] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4355 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:02:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:30] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09677 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:06:37] 10SRE, 10ops-codfw: Document codfw breakout patch panels in Netbox - https://phabricator.wikimedia.org/T304710 (10ayounsi) [16:11:53] (03PS2) 10Majavah: toolforge: remove ingress-nginx manifests [puppet] - 10https://gerrit.wikimedia.org/r/773750 [16:12:12] (03CR) 10Majavah: toolforge: remove ingress-nginx manifests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773750 (owner: 10Majavah) [16:12:37] (03CR) 10Elukey: [C: 03+2] sre.kafka.roll-restart-brokers: generalize the restart reason [cookbooks] - 10https://gerrit.wikimedia.org/r/773475 (owner: 10Elukey) [16:12:41] (03PS2) 10Elukey: sre.kafka.roll-restart-brokers: generalize the restart reason [cookbooks] - 10https://gerrit.wikimedia.org/r/773475 [16:12:44] (03CR) 10Elukey: [V: 03+2 C: 03+2] sre.kafka.roll-restart-brokers: generalize the restart reason [cookbooks] - 10https://gerrit.wikimedia.org/r/773475 (owner: 10Elukey) [16:12:51] (03PS1) 10JMeybohm: Allow to specify additional gatewayHosts without overriding the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/773805 (https://phabricator.wikimedia.org/T290966) [16:13:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:21] (03PS2) 10Elukey: Add helmfile config for Istio proxy sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/773565 (https://phabricator.wikimedia.org/T297612) [16:14:35] (03PS3) 10Jbond: POC: P:thanos::swift::frontend: move ring manager config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/773794 [16:16:19] (03CR) 10Cwhite: [C: 03+1] sre: add ProbeDown paging alert for enabled services [alerts] - 10https://gerrit.wikimedia.org/r/773747 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [16:16:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23112 and previous config saved to /var/cache/conftool/dbconfig/20220325-161631-ladsgroup.json [16:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:37] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:19:55] (03PS1) 10Vivian Rook: Update codfw1dev cloudservices openstack [puppet] - 10https://gerrit.wikimedia.org/r/773806 (https://phabricator.wikimedia.org/T304702) [16:20:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge: remove ingress-nginx manifests [puppet] - 10https://gerrit.wikimedia.org/r/773750 (owner: 10Majavah) [16:21:57] (03PS2) 10Jbond: P:thanos::swift: demo changing wieghts and draining [puppet] - 10https://gerrit.wikimedia.org/r/773795 [16:23:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: cp1090.mgmt ssh port not accessible - https://phabricator.wikimedia.org/T304589 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson re-seated the mgmt cable. no issues logging into mgmt interface root@cp1090.mgmt.eqiad.wmnet's password: /admin1-> [16:23:45] (03CR) 10Andrew Bogott: [C: 03+1] "I'm all for moving things out of the monolithic puppet gerrit repo whenever possible. Thanks taavi!" [puppet] - 10https://gerrit.wikimedia.org/r/773750 (owner: 10Majavah) [16:23:50] 10SRE, 10Wikimedia-Etherpad, 10serviceops: Etherpads corrupted - https://phabricator.wikimedia.org/T304005 (10Dzahn) a:03Dzahn [16:24:06] 10SRE, 10ops-eqiad, 10serviceops: mc1053 PS redundancy alert - https://phabricator.wikimedia.org/T304477 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson Fixed [16:24:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: remove ingress-nginx manifests [puppet] - 10https://gerrit.wikimedia.org/r/773750 (owner: 10Majavah) [16:27:03] (03CR) 10Jcrespo: "Ok to me, I trust your suggestion this is better. :-)" [puppet] - 10https://gerrit.wikimedia.org/r/771560 (owner: 10Muehlenhoff) [16:29:11] 10SRE, 10Infrastructure-Foundations, 10Release-Engineering-Team (🚂🧪 Trainsperiment Week): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10thcipriani) > Approving manager: @thcipriani Approved from my side! Tagging #SRE as well — not sure about the current best task i... [16:31:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23114 and previous config saved to /var/cache/conftool/dbconfig/20220325-163136-ladsgroup.json [16:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:34] (03CR) 10Andrew Bogott: [C: 03+1] Update codfw1dev cloudservices openstack [puppet] - 10https://gerrit.wikimedia.org/r/773806 (https://phabricator.wikimedia.org/T304702) (owner: 10Vivian Rook) [16:34:26] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:29] (03PS8) 10Giuseppe Lavagetto: Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471) [16:34:31] (03PS2) 10Giuseppe Lavagetto: Add debian packaging for requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/773760 [16:35:18] (03CR) 10jerkins-bot: [V: 04-1] Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [16:35:20] (03CR) 10jerkins-bot: [V: 04-1] Add debian packaging for requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/773760 (owner: 10Giuseppe Lavagetto) [16:36:09] (03CR) 10Giuseppe Lavagetto: Introduce requestctl (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [16:36:41] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10decommission-hardware: Decommission ms-fe100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T304064 (10Cmjohnson) [16:37:12] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10decommission-hardware: Decommission ms-fe100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T304064 (10Cmjohnson) 05Open→03Resolved Removed from rack and netbox updated [16:37:29] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:38] (03CR) 10Jcrespo: mediabackup::storage: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771560 (owner: 10Muehlenhoff) [16:39:40] (03PS9) 10Giuseppe Lavagetto: Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471) [16:39:42] (03PS3) 10Giuseppe Lavagetto: Add debian packaging for requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/773760 [16:40:02] RECOVERY - IPMI Sensor Status on mc1053 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:41:56] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:43:28] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (🚂🧪 Trainsperiment Week): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10jcrespo) Normally this kind of tasks would be routed by the person in clinic duty, but I just happened to see him g... [16:43:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [16:44:14] 10SRE, 10Wikimedia-Mailing-lists: Mailman3: 550-Support for list subscription via email has been disabled. - https://phabricator.wikimedia.org/T303888 (10Urbanecm) >>! In T303888#7805113, @Ladsgroup wrote: > Yup, this is something we carried over from mailman2 given the history of abuse with mass subscription... [16:45:21] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (🚂🧪 Trainsperiment Week): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10Dzahn) [16:45:48] (03Merged) 10jenkins-bot: Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [16:46:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23115 and previous config saved to /var/cache/conftool/dbconfig/20220325-164641-ladsgroup.json [16:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:53] 10SRE, 10Maps: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10JMinor) a:05MSantos→03JMinor Looks like were set. Just need to close the loop with the BBC folks. Will resolve when confirmed. Thank you! [16:48:34] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (🚂🧪 Trainsperiment Week): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10Joe) a:03Joe I'll take care of this, as I assume it's not urgent to be completed before... [16:49:06] PROBLEM - Host ml-cache1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:49:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add debian packaging for requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/773760 (owner: 10Giuseppe Lavagetto) [16:50:44] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:09] (03Merged) 10jenkins-bot: Add debian packaging for requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/773760 (owner: 10Giuseppe Lavagetto) [16:53:36] (03CR) 10Dzahn: [C: 03+2] admin: reactivate account for Mark Hershberger, add to Mediawiki releasers [puppet] - 10https://gerrit.wikimedia.org/r/773660 (https://phabricator.wikimedia.org/T302287) (owner: 10Dzahn) [16:57:09] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10Dzahn) ` [releases1002:~] $ id mah uid=1232(mah) gid=500(wikidev) groups=500(wikidev),711(releasers-mediawiki) [releases2002:~] $ id mah uid=1232(mah) gid=500(w... [16:57:43] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson) @elukey I moved ml-cache1002 to row/rack C4. [16:58:55] (03CR) 10Phuedx: Request high-entropy Sec-CH-UA* client hints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx) [17:00:10] (03CR) 10Jcrespo: [C: 03+2] "Bah, no worth time discussing, let's just deploy as it is." [puppet] - 10https://gerrit.wikimedia.org/r/771560 (owner: 10Muehlenhoff) [17:00:33] (03CR) 10Jbond: [V: 03+1 C: 03+2] wmflib: add class_hosts [puppet] - 10https://gerrit.wikimedia.org/r/771437 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [17:01:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23116 and previous config saved to /var/cache/conftool/dbconfig/20220325-170146-ladsgroup.json [17:01:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [17:01:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [17:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:52] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:01:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23117 and previous config saved to /var/cache/conftool/dbconfig/20220325-170154-ladsgroup.json [17:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:54] (03Abandoned) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh [puppet] - 10https://gerrit.wikimedia.org/r/773448 (https://phabricator.wikimedia.org/T303931) (owner: 10Majavah) [17:09:27] (03PS8) 10Phuedx: Request high-entropy Sec-CH-UA* client hints [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) [17:10:47] (03PS1) 10Andrew Bogott: Move cloudstore1008/1009 to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/773819 (https://phabricator.wikimedia.org/T291405) [17:14:16] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1002.eqiad.wmnet with OS bullseye [17:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:21] (03CR) 10Andrew Bogott: [C: 03+2] Move cloudstore1008/1009 to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/773819 (https://phabricator.wikimedia.org/T291405) (owner: 10Andrew Bogott) [17:14:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye [17:14:36] RECOVERY - Host ml-cache1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [17:16:08] (03PS1) 10Majavah: hieradata: generate cfssl certs for cloudmetrics* [puppet] - 10https://gerrit.wikimedia.org/r/773821 [17:16:44] (03PS9) 10Phuedx: Request high-entropy Sec-CH-UA* client hints [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) [17:17:47] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34566/console" [puppet] - 10https://gerrit.wikimedia.org/r/773821 (owner: 10Majavah) [17:19:34] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: generate cfssl certs for cloudmetrics* [puppet] - 10https://gerrit.wikimedia.org/r/773821 (owner: 10Majavah) [17:28:08] (03CR) 10BBlack: [C: 03+1] "Looks great now! Testsuite caught a couple of minor syntax issues fixed in PS8 and PS9, all clean on both text and upload runs now." [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx) [17:29:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23118 and previous config saved to /var/cache/conftool/dbconfig/20220325-172916-ladsgroup.json [17:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:21] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:32:33] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (🚂🧪 Trainsperiment Week): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10dancy) Hi @Joe. This request is not urgent so it can wait until next week. The plan is f... [17:32:43] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 121 probes of 675 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:35:21] PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:38:15] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 60 probes of 675 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:42:42] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-cache1002.eqiad.wmnet with OS bullseye [17:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye executed wit... [17:44:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23119 and previous config saved to /var/cache/conftool/dbconfig/20220325-174421-ladsgroup.json [17:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:16] (03PS3) 10Elukey: Add helmfile config for Istio proxy sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/773565 (https://phabricator.wikimedia.org/T297612) [17:53:01] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:59:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23120 and previous config saved to /var/cache/conftool/dbconfig/20220325-175926-ladsgroup.json [17:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:53] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms [18:14:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23121 and previous config saved to /var/cache/conftool/dbconfig/20220325-181431-ladsgroup.json [18:14:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [18:14:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [18:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:38] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:14:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23122 and previous config saved to /var/cache/conftool/dbconfig/20220325-181439-ladsgroup.json [18:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:39:11] 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10Dzahn) [18:39:21] 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10Dzahn) a:03Dzahn [18:40:37] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34570/console" [puppet] - 10https://gerrit.wikimedia.org/r/773806 (https://phabricator.wikimedia.org/T304702) (owner: 10Vivian Rook) [18:43:24] 10SRE, 10Patch-For-Review, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) [18:44:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23123 and previous config saved to /var/cache/conftool/dbconfig/20220325-184406-ladsgroup.json [18:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:11] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:51:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:59:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23124 and previous config saved to /var/cache/conftool/dbconfig/20220325-185911-ladsgroup.json [18:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:09] (03PS1) 10Dzahn: dumps: add description for Bugzilla HTML dump file [puppet] - 10https://gerrit.wikimedia.org/r/773832 (https://phabricator.wikimedia.org/T284193) [19:02:14] (03CR) 10Dzahn: [C: 03+2] "just a few words of description in HTML" [puppet] - 10https://gerrit.wikimedia.org/r/773832 (https://phabricator.wikimedia.org/T284193) (owner: 10Dzahn) [19:10:10] !log copying dump from deploy server to dumps server: scp -3 deploy1002.eqiad.wmnet:/srv/miscweb/static-bugzilla.tar.gz labstore1006.wikimedia.org:~ (T284193) [19:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:16] T284193: put static-bugzilla HTML dump on dumps servers - https://phabricator.wikimedia.org/T284193 [19:10:34] 10SRE, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Quiddity) @Ottomata Ping in case this fresh example helps. It's unclear from the last engineer comment above (Petr's at T93049#659... [19:14:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23125 and previous config saved to /var/cache/conftool/dbconfig/20220325-191416-ladsgroup.json [19:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23126 and previous config saved to /var/cache/conftool/dbconfig/20220325-192923-ladsgroup.json [19:29:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [19:29:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [19:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:30] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:00] (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:36:51] RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:51:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [19:51:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [19:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23127 and previous config saved to /var/cache/conftool/dbconfig/20220325-195137-ladsgroup.json [19:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:44] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:56:58] !Log deploy1002 - removing /srv/miscweb and files inside it, moved to dumps, was only needed temporary, meanwhile inside the container repo for k8s as well (T284193), cleaning up deploy1002 [19:56:58] T284193: put static-bugzilla HTML dump on dumps servers - https://phabricator.wikimedia.org/T284193 [19:58:20] 10SRE, 10Patch-For-Review, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) [20:05:36] (03CR) 10JMeybohm: [C: 04-1] "I'd argue to set `istio_sidecar_proxy: true` for the ml-clusters in this patch as well to have the new helmfile rendered (and the diff vis" [deployment-charts] - 10https://gerrit.wikimedia.org/r/773565 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [20:06:24] (03PS1) 10Dzahn: puppetmaster:geoip: stop trying to download GeoIP1 legacy databases [puppet] - 10https://gerrit.wikimedia.org/r/773843 (https://phabricator.wikimedia.org/T303464) [20:09:12] away [20:10:39] (03PS1) 10Dzahn: geoip::maxmind: remove code for absenting old resources [puppet] - 10https://gerrit.wikimedia.org/r/773844 (https://phabricator.wikimedia.org/T303464) [20:15:14] (03PS1) 10Dzahn: geoip::maxmind: rename the legacy timer to geoip2 [puppet] - 10https://gerrit.wikimedia.org/r/773845 (https://phabricator.wikimedia.org/T303464) [20:16:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23128 and previous config saved to /var/cache/conftool/dbconfig/20220325-201613-ladsgroup.json [20:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:19] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:31:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23129 and previous config saved to /var/cache/conftool/dbconfig/20220325-203118-ladsgroup.json [20:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:01] 10SRE, 10Wikimedia-Etherpad, 10serviceops: Etherpads corrupted - https://phabricator.wikimedia.org/T304005 (10Dzahn) Hello @Zapipedia-WMF so I think what happened here is, the first case was likely caused by me doing the maintenance because I had the service running from 2 servers at the same time. I knew... [20:46:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23130 and previous config saved to /var/cache/conftool/dbconfig/20220325-204623-ladsgroup.json [20:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:13] 10SRE, 10Wikimedia-Mailing-lists: Email spam from varying tawk.email addresses - https://phabricator.wikimedia.org/T304390 (10Quiddity) That resulted in an error message: ` An error occurred: Invalid Parameter "email": Expected a valid email address or regular expression, got .+\.tawk\.email$. ` Looking at the... [20:59:29] (03PS2) 10JMeybohm: Allow to specify additional gatewayHosts without overriding the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/773805 (https://phabricator.wikimedia.org/T290966) [21:01:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23131 and previous config saved to /var/cache/conftool/dbconfig/20220325-210128-ladsgroup.json [21:01:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [21:01:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [21:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:34] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:01:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23132 and previous config saved to /var/cache/conftool/dbconfig/20220325-210136-ladsgroup.json [21:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:52] 10SRE, 10Wikimedia-Etherpad, 10serviceops: Etherpads corrupted - https://phabricator.wikimedia.org/T304005 (10Dzahn) I documented this at https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org#Restoring_a_pad_to_a_previous_revision [21:03:01] 10SRE, 10Wikimedia-Etherpad, 10serviceops: Etherpads corrupted - https://phabricator.wikimedia.org/T304005 (10Dzahn) 05Open→03Resolved claiming resolved, let me know if you agree [21:03:44] (03CR) 10JMeybohm: "I think we're almost good to go. I've another small patch to scaffolding and _ingress_helper that I would like to rebase this on, though: " [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [21:06:22] 10SRE, 10Data-Engineering, 10Traffic, 10Trust-and-Safety, and 2 others: Disable GeoIP Legacy Download - https://phabricator.wikimedia.org/T303464 (10Dzahn) a:03Dzahn [21:08:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23133 and previous config saved to /var/cache/conftool/dbconfig/20220325-210831-ladsgroup.json [21:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:38] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:18:36] (03Abandoned) 10Dzahn: geoip::data::maxmind: deactivate timer for downloading of legacy DBs [puppet] - 10https://gerrit.wikimedia.org/r/773648 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn) [21:21:36] (03CR) 10Dzahn: [C: 03+2] zuul: stop keeping reflog on the mergers [puppet] - 10https://gerrit.wikimedia.org/r/757943 (owner: 10Hashar) [21:23:04] mutante: oh thanks, I kind of forgot about those zuul-merger git settings :D [21:23:33] hashar: no problem, same here but backlog after vacation. [21:23:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23134 and previous config saved to /var/cache/conftool/dbconfig/20220325-212336-ladsgroup.json [21:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:01] hashar: I assume I don't need to go to an actual merger now [21:24:23] I did compare git::userconfig to another case, git option makes sense [21:24:40] yeah [21:24:46] ok, cool [21:24:56] (03PS3) 10Hashar: zuul: prune heads and tags on each fetches [puppet] - 10https://gerrit.wikimedia.org/r/757944 (https://phabricator.wikimedia.org/T220606) [21:25:19] mutante: and there is a 2nd one which I have rebased https://gerrit.wikimedia.org/r/c/operations/puppet/+/757944/ :] [21:25:42] to prune branches and tags when fetching from Gerrit [21:25:59] else deleted tags stay behind on the zuul-merger git repos [21:26:03] yea, I saw that.. well..since you are here.. let's do that too [21:26:12] the previous one seemed safer [21:26:12] and the branches keep accumulating, notably the wmf/* branches for mediawiki repos :] [21:26:21] ack [21:26:41] I have manually pruned the branches and tags a few weeks ago [21:26:51] (03CR) 10Dzahn: [C: 03+2] zuul: prune heads and tags on each fetches [puppet] - 10https://gerrit.wikimedia.org/r/757944 (https://phabricator.wikimedia.org/T220606) (owner: 10Hashar) [21:27:01] great. here we go [21:27:05] I should pay more attention to the patch I send for review [21:27:16] (03CR) 10Krinkle: Relax CSP rules for taint-check-demo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) (owner: 10Daimona Eaytoy) [21:27:38] hashar: merged on puppetmaster. wanna check in cloud? [21:28:24] they run on contint2001 and contint1001, I am checking puppet [21:29:10] oh, zuul::merger, of course. doing that too [21:29:21] $ sudo -H -u zuul git config --list [21:29:21] protocol.version=2 [21:29:21] core.logallrefupdates=false [21:29:21] fetch.prune=true [21:29:21] fetch.prunetags=true [21:30:12] ack:) looks good [21:31:32] yeah that will make the git operations slightly faster [21:31:45] will close the tasks on monday after I have verified [21:31:54] tells performance team, hehe [21:32:07] cool, have a good weekend then [21:33:44] :] [21:33:50] danke schon have a merry week-end [21:34:14] de rien [21:36:53] (03PS3) 10Dzahn: aptrepo: import gitlab-runner package for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/767604 (https://phabricator.wikimedia.org/T297659) [21:37:51] (03CR) 10Dzahn: "Is this what you meant, Moritz? Not exactly line 214 (because it it's sorted alpha) but that must be the file you mean.. only that has "Up" [puppet] - 10https://gerrit.wikimedia.org/r/767604 (https://phabricator.wikimedia.org/T297659) (owner: 10Dzahn) [21:38:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23135 and previous config saved to /var/cache/conftool/dbconfig/20220325-213841-ladsgroup.json [21:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:05] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:53:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23136 and previous config saved to /var/cache/conftool/dbconfig/20220325-215346-ladsgroup.json [21:53:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [21:53:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [21:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [21:53:52] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [21:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23137 and previous config saved to /var/cache/conftool/dbconfig/20220325-215400-ladsgroup.json [21:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:38] (03PS1) 10Krinkle: [WIP] wgKartographerStaticMapframe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773883 [22:07:09] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:13:49] (03PS2) 10Krinkle: [WIP] wgKartographerStaticMapframe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773883 [22:16:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:07] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4355 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [22:18:10] (03PS3) 10Krinkle: [WIP] wgKartographerStaticMapframe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773883 [22:19:10] (03PS4) 10Krinkle: [WIP] wgKartographerStaticMapframe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773883 [22:20:08] (03PS5) 10Krinkle: [WIP] wgKartographerStaticMapframe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773883 [22:20:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23138 and previous config saved to /var/cache/conftool/dbconfig/20220325-222025-ladsgroup.json [22:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:30] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:20:35] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.04839 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [22:35:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23139 and previous config saved to /var/cache/conftool/dbconfig/20220325-223530-ladsgroup.json [22:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:17] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3871 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [22:50:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23140 and previous config saved to /var/cache/conftool/dbconfig/20220325-225035-ladsgroup.json [22:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:13] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3226 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [23:05:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23141 and previous config saved to /var/cache/conftool/dbconfig/20220325-230540-ladsgroup.json [23:05:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [23:05:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [23:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:46] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:24] (03PS6) 10Krinkle: List Kartographer static map exemptions and document+flip default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773883 (https://phabricator.wikimedia.org/T291736) [23:30:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [23:30:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [23:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:00] (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:34:03] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3387 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [23:36:28] (03CR) 10Awight: [C: 03+1] "Seems like a good idea, and safe to experiment with. Is there any monitoring that we can use to measure the improvement? Or maybe not be" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773883 (https://phabricator.wikimedia.org/T291736) (owner: 10Krinkle) [23:42:19] (03CR) 10Krinkle: List Kartographer static map exemptions and document+flip default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773883 (https://phabricator.wikimedia.org/T291736) (owner: 10Krinkle) [23:43:01] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.08065 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [23:53:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [23:53:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [23:53:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [23:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [23:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [23:58:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [23:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23142 and previous config saved to /var/cache/conftool/dbconfig/20220325-235855-ladsgroup.json [23:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:00] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565