[00:47:19] PROBLEM - Check systemd state on apifeatureusage1001 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage_codfw.service,curator_actions_apifeatureusage_eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:29] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:40:30] (JobUnavailable) firing: (3) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [02:08:39] 10SRE, 10DNS, 10Traffic: Central and South American countries in geo-maps - https://phabricator.wikimedia.org/T301605 (10RLazarus) [02:28:32] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for rzl - https://phabricator.wikimedia.org/T301606 (10RLazarus) [02:41:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T300775)', diff saved to https://phabricator.wikimedia.org/P20613 and previous config saved to /var/cache/conftool/dbconfig/20220212-024155-marostegui.json [02:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:02] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [02:57:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P20614 and previous config saved to /var/cache/conftool/dbconfig/20220212-025700-marostegui.json [02:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:10:12] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:12:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P20615 and previous config saved to /var/cache/conftool/dbconfig/20220212-031205-marostegui.json [03:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:27:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T300775)', diff saved to https://phabricator.wikimedia.org/P20616 and previous config saved to /var/cache/conftool/dbconfig/20220212-032710-marostegui.json [03:27:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [03:27:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [03:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:27:15] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [03:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:11:31] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:42:35] PROBLEM - Disk space on krb1001 is CRITICAL: DISK CRITICAL - free space: / 1051 MB (2% inode=97%): /tmp 1051 MB (2% inode=97%): /var/tmp 1051 MB (2% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops [05:14:25] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 85 probes of 652 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:20:49] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 56 probes of 652 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:22:33] 10SRE, 10LDAP-Access-Requests: Grant Access to gerritadmin for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T301579 (10Legoktm) >>! In T301579#7704673, @thcipriani wrote: > I'm in favor of this, but I'd like feedback of the other active gerritadmins as to whether they need more help. I don't t... [05:41:05] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:53:37] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:23:37] !log restarting blazegraph on wdqs1004 (jvm stuck for 4hours) [07:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220212T0800) [08:41:03] (03CR) 10Jelto: [C: 03+2] aptrepo::files::updates Update gitlab-ce and gitlab-runner to 14.7 [puppet] - 10https://gerrit.wikimedia.org/r/761917 (https://phabricator.wikimedia.org/T300967) (owner: 10Jelto) [08:49:51] !log truncate /var/log/auth.log to 1g on krb1001 to free space on root partition (original log saved under /srv) [08:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:58] I'll open a task on monday for this --^ [08:58:09] RECOVERY - Disk space on krb1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops [09:41:05] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:41:18] !log update gitlab2001 to gitlab-ce 14.7.2-ce.0 [09:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:50] !log update gitlab1001 to gitlab-ce 14.7.2-ce.0 [09:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:38] !log update gitlab-runner1001 and gitlab-runner2001 to gitlab-runner 14.7.0 [10:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:07] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:14] (03CR) 10BBlack: [C: 03+1] wikimedia-dns.org: add AAAA records for Wikidough [dns] - 10https://gerrit.wikimedia.org/r/761363 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [11:21:43] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:10:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [12:10:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [12:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:01] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:41:05] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [14:53:59] 10SRE, 10Discovery, 10Infrastructure-Foundations, 10netops: Speed up network connections for Elastic hosts - https://phabricator.wikimedia.org/T301577 (10cmooney) Jumbo frames only bring benefits under very specific conditions for particular workloads. What evaluation of traffic patterns has been done tha... [15:35:39] (03PS4) 10Daimona Eaytoy: Relax CSP rules for taint-check-demo [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) [15:35:43] (03PS5) 10Daimona Eaytoy: Relax CSP rules for taint-check-demo [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) [15:37:45] (03PS6) 10Daimona Eaytoy: Relax CSP rules for taint-check-demo [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) [16:01:12] (03Abandoned) 10Zabe: Start writing to $wmgConfigDir the same value as to $wmfConfigDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760570 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [16:03:02] (03CR) 10Zabe: [C: 03+1] wmf-config: Use __DIR__ instead of re-using an unintended global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761963 (https://phabricator.wikimedia.org/T45956) (owner: 10Krinkle) [16:04:01] PROBLEM - SSH on wtp1027.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:59:16] 10SRE, 10Phabricator, 10vm-requests: VM Request template (form 84) title doesn't make sense - https://phabricator.wikimedia.org/T301387 (10Aklapper) That doesn't seem to be form 84? [17:03:24] 10Puppet, 10Infrastructure-Foundations: update hiera order ii production environment - https://phabricator.wikimedia.org/T301349 (10Aklapper) [17:05:22] RECOVERY - SSH on wtp1027.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:12:39] 10SRE, 10Wikidata, 10wdwb-tech: Move dispatching of wikidata to a dedicated node - https://phabricator.wikimedia.org/T193733 (10Bugreporter) is this task still valid? [17:13:17] (03PS1) 10Andrew Bogott: novaproxy: add redirects for wmfcloud.org and www.wmfcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/762074 (https://phabricator.wikimedia.org/T301592) [17:16:15] 10SRE, 10Wikidata, 10wdwb-tech: Move dispatching of wikidata to a dedicated node - https://phabricator.wikimedia.org/T193733 (10Ladsgroup) 05Stalled→03Declined No. [17:23:50] (03PS1) 10Andrew Bogott: Add parking domain for wmfcloud.org [dns] - 10https://gerrit.wikimedia.org/r/762075 (https://phabricator.wikimedia.org/T301592) [17:24:53] (03CR) 10jerkins-bot: [V: 04-1] Add parking domain for wmfcloud.org [dns] - 10https://gerrit.wikimedia.org/r/762075 (https://phabricator.wikimedia.org/T301592) (owner: 10Andrew Bogott) [17:26:20] (03PS2) 10Andrew Bogott: Add parking domain for wmfcloud.org [dns] - 10https://gerrit.wikimedia.org/r/762075 (https://phabricator.wikimedia.org/T301592) [17:27:11] (03CR) 10jerkins-bot: [V: 04-1] Add parking domain for wmfcloud.org [dns] - 10https://gerrit.wikimedia.org/r/762075 (https://phabricator.wikimedia.org/T301592) (owner: 10Andrew Bogott) [17:29:23] (03CR) 10Andrew Bogott: "@Valentin -- I definitely don't know what I'm doing here, but I'm hoping to add slightly-helpful behavior for the typo domain .wmflabs.org" [dns] - 10https://gerrit.wikimedia.org/r/762075 (https://phabricator.wikimedia.org/T301592) (owner: 10Andrew Bogott) [17:41:05] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:52:21] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10Htriedman) Hi SRE team! Just a couple of clarifications here — the approving party is actually @JBennett, rather than myself. With regard to the NDA and SAR forms, the confi... [17:59:16] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10Ladsgroup) 05Open→03Resolved I believe that one took only five minutes and Wikipedia has been accessible after that. So I close t... [18:06:11] (03PS1) 10Andrew Bogott: nfs-mounts.yaml: move puppet-diffs to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/762076 (https://phabricator.wikimedia.org/T301280) [18:09:34] (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts.yaml: move puppet-diffs to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/762076 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [18:12:31] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:12:49] 10SRE, 10Phabricator, 10vm-requests: VM Request template (form 90) title doesn't make sense - https://phabricator.wikimedia.org/T301387 (10RhinosF1) [18:13:20] 10SRE, 10Phabricator, 10vm-requests: VM Request template (form 90) title doesn't make sense - https://phabricator.wikimedia.org/T301387 (10RhinosF1) Oops, got the number from the link below on the menu by accident. https://phabricator.wikimedia.org/maniphest/task/edit/form/90/ [18:20:41] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10RhinosF1) SRE will be able to check on their tracking sheets or confirm with legal. No need to worry but thanks for being super clear :) @Ottomata normally needs to approve... [18:20:52] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10RhinosF1) [18:27:01] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:51:07] 10SRE, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Snaevar) >>! In T93049#7381360, @Quiddity wrote: > @Snaevar Thanks for the report. Unfortunately, (per the comment above in T93049... [18:56:17] (03PS1) 10Andrew Bogott: nfs-mounts.yaml: move dumps to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/762079 (https://phabricator.wikimedia.org/T301280) [19:27:18] (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts.yaml: move dumps to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/762079 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [19:42:40] (03PS1) 10Andrew Bogott: cinder backups: add some new nfs servers to the backup roster [puppet] - 10https://gerrit.wikimedia.org/r/762091 (https://phabricator.wikimedia.org/T301280) [19:44:33] (03CR) 10Andrew Bogott: [C: 03+2] cinder backups: add some new nfs servers to the backup roster [puppet] - 10https://gerrit.wikimedia.org/r/762091 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [19:55:48] (03PS1) 10Andrew Bogott: wmcs-cinder-backup-manager: fix some volume names [puppet] - 10https://gerrit.wikimedia.org/r/762092 [19:59:43] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-backup-manager: fix some volume names [puppet] - 10https://gerrit.wikimedia.org/r/762092 (owner: 10Andrew Bogott) [20:47:05] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:58:53] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [21:41:05] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:57:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [22:57:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [22:57:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [22:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [22:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T300775)', diff saved to https://phabricator.wikimedia.org/P20617 and previous config saved to /var/cache/conftool/dbconfig/20220212-225806-marostegui.json [22:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:14] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775