[00:47:19] <icinga-wm>	 PROBLEM - Check systemd state on apifeatureusage1001 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage_codfw.service,curator_actions_apifeatureusage_eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:47:29] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:40:30] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[02:08:39] <wikibugs>	 10SRE, 10DNS, 10Traffic: Central and South American countries in geo-maps - https://phabricator.wikimedia.org/T301605 (10RLazarus)
[02:28:32] <wikibugs>	 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for rzl - https://phabricator.wikimedia.org/T301606 (10RLazarus)
[02:41:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T300775)', diff saved to https://phabricator.wikimedia.org/P20613 and previous config saved to /var/cache/conftool/dbconfig/20220212-024155-marostegui.json
[02:42:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:42:02] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[02:57:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P20614 and previous config saved to /var/cache/conftool/dbconfig/20220212-025700-marostegui.json
[02:57:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:10:12] <icinga-wm>	 PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:12:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P20615 and previous config saved to /var/cache/conftool/dbconfig/20220212-031205-marostegui.json
[03:12:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:27:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T300775)', diff saved to https://phabricator.wikimedia.org/P20616 and previous config saved to /var/cache/conftool/dbconfig/20220212-032710-marostegui.json
[03:27:11] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[03:27:13] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[03:27:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:27:15] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[03:27:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:27:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:11:31] <icinga-wm>	 RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:42:35] <icinga-wm>	 PROBLEM - Disk space on krb1001 is CRITICAL: DISK CRITICAL - free space: / 1051 MB (2% inode=97%): /tmp 1051 MB (2% inode=97%): /var/tmp 1051 MB (2% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops
[05:14:25] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 85 probes of 652 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[05:20:49] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 56 probes of 652 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[05:22:33] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to gerritadmin for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T301579 (10Legoktm) >>! In T301579#7704673, @thcipriani wrote: > I'm in favor of this, but I'd like feedback of the other active gerritadmins as to whether they need more help.  I don't t...
[05:41:05] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[05:53:37] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:23:37] <dcausse>	 !log restarting blazegraph on wdqs1004 (jvm stuck for 4hours)
[07:23:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220212T0800)
[08:41:03] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] aptrepo::files::updates Update gitlab-ce and gitlab-runner to 14.7 [puppet] - 10https://gerrit.wikimedia.org/r/761917 (https://phabricator.wikimedia.org/T300967) (owner: 10Jelto)
[08:49:51] <elukey>	 !log truncate /var/log/auth.log to 1g on krb1001 to free space on root partition (original log saved under /srv)
[08:49:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:58] <elukey>	 I'll open a task on monday for this --^
[08:58:09] <icinga-wm>	 RECOVERY - Disk space on krb1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops
[09:41:05] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[09:41:18] <jelto>	 !log update gitlab2001 to gitlab-ce 14.7.2-ce.0
[09:41:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:50] <jelto>	 !log update gitlab1001 to gitlab-ce 14.7.2-ce.0
[09:52:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:38] <jelto>	 !log update gitlab-runner1001 and gitlab-runner2001 to gitlab-runner 14.7.0
[10:02:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:07] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:59:14] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] wikimedia-dns.org: add AAAA records for Wikidough [dns] - 10https://gerrit.wikimedia.org/r/761363 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh)
[11:21:43] <icinga-wm>	 PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:10:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[12:10:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[12:10:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:01] <icinga-wm>	 RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:41:05] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[14:53:59] <wikibugs>	 10SRE, 10Discovery, 10Infrastructure-Foundations, 10netops: Speed up network connections for Elastic hosts - https://phabricator.wikimedia.org/T301577 (10cmooney) Jumbo frames only bring benefits under very specific conditions for particular workloads.  What evaluation of traffic patterns has been done tha...
[15:35:39] <wikibugs>	 (03PS4) 10Daimona Eaytoy: Relax CSP rules for taint-check-demo [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301)
[15:35:43] <wikibugs>	 (03PS5) 10Daimona Eaytoy: Relax CSP rules for taint-check-demo [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301)
[15:37:45] <wikibugs>	 (03PS6) 10Daimona Eaytoy: Relax CSP rules for taint-check-demo [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301)
[16:01:12] <wikibugs>	 (03Abandoned) 10Zabe: Start writing to $wmgConfigDir the same value as to $wmfConfigDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760570 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[16:03:02] <wikibugs>	 (03CR) 10Zabe: [C: 03+1] wmf-config: Use __DIR__ instead of re-using an unintended global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761963 (https://phabricator.wikimedia.org/T45956) (owner: 10Krinkle)
[16:04:01] <icinga-wm>	 PROBLEM - SSH on wtp1027.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:59:16] <wikibugs>	 10SRE, 10Phabricator, 10vm-requests: VM Request template (form 84) title doesn't make sense - https://phabricator.wikimedia.org/T301387 (10Aklapper) That doesn't seem to be form 84?
[17:03:24] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: update hiera order ii production environment - https://phabricator.wikimedia.org/T301349 (10Aklapper)
[17:05:22] <icinga-wm>	 RECOVERY - SSH on wtp1027.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:12:39] <wikibugs>	 10SRE, 10Wikidata, 10wdwb-tech: Move dispatching of wikidata to a dedicated node - https://phabricator.wikimedia.org/T193733 (10Bugreporter) is this task still valid?
[17:13:17] <wikibugs>	 (03PS1) 10Andrew Bogott: novaproxy: add redirects for wmfcloud.org and www.wmfcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/762074 (https://phabricator.wikimedia.org/T301592)
[17:16:15] <wikibugs>	 10SRE, 10Wikidata, 10wdwb-tech: Move dispatching of wikidata to a dedicated node - https://phabricator.wikimedia.org/T193733 (10Ladsgroup) 05Stalled→03Declined No.
[17:23:50] <wikibugs>	 (03PS1) 10Andrew Bogott: Add parking domain for wmfcloud.org [dns] - 10https://gerrit.wikimedia.org/r/762075 (https://phabricator.wikimedia.org/T301592)
[17:24:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add parking domain for wmfcloud.org [dns] - 10https://gerrit.wikimedia.org/r/762075 (https://phabricator.wikimedia.org/T301592) (owner: 10Andrew Bogott)
[17:26:20] <wikibugs>	 (03PS2) 10Andrew Bogott: Add parking domain for wmfcloud.org [dns] - 10https://gerrit.wikimedia.org/r/762075 (https://phabricator.wikimedia.org/T301592)
[17:27:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add parking domain for wmfcloud.org [dns] - 10https://gerrit.wikimedia.org/r/762075 (https://phabricator.wikimedia.org/T301592) (owner: 10Andrew Bogott)
[17:29:23] <wikibugs>	 (03CR) 10Andrew Bogott: "@Valentin -- I definitely don't know what I'm doing here, but I'm hoping to add slightly-helpful behavior for the typo domain .wmflabs.org" [dns] - 10https://gerrit.wikimedia.org/r/762075 (https://phabricator.wikimedia.org/T301592) (owner: 10Andrew Bogott)
[17:41:05] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[17:52:21] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10Htriedman) Hi SRE team! Just a couple of clarifications here — the approving party is actually @JBennett, rather than myself.  With regard to the NDA and SAR forms, the confi...
[17:59:16] <wikibugs>	 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10Ladsgroup) 05Open→03Resolved I believe that one took only five minutes and Wikipedia has been accessible after that. So I close t...
[18:06:11] <wikibugs>	 (03PS1) 10Andrew Bogott: nfs-mounts.yaml: move puppet-diffs to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/762076 (https://phabricator.wikimedia.org/T301280)
[18:09:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts.yaml: move puppet-diffs to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/762076 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott)
[18:12:31] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[18:12:49] <wikibugs>	 10SRE, 10Phabricator, 10vm-requests: VM Request template (form 90) title doesn't make sense - https://phabricator.wikimedia.org/T301387 (10RhinosF1)
[18:13:20] <wikibugs>	 10SRE, 10Phabricator, 10vm-requests: VM Request template (form 90) title doesn't make sense - https://phabricator.wikimedia.org/T301387 (10RhinosF1) Oops, got the number from the link below on the menu by accident.  https://phabricator.wikimedia.org/maniphest/task/edit/form/90/
[18:20:41] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10RhinosF1) SRE will be able to check on their tracking sheets or confirm with legal. No need to worry but thanks for being super clear :)  @Ottomata normally needs to approve...
[18:20:52] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10RhinosF1)
[18:27:01] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[18:51:07] <wikibugs>	 10SRE, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Snaevar) >>! In T93049#7381360, @Quiddity wrote: > @Snaevar Thanks for the report. Unfortunately, (per the comment above in T93049...
[18:56:17] <wikibugs>	 (03PS1) 10Andrew Bogott: nfs-mounts.yaml: move dumps to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/762079 (https://phabricator.wikimedia.org/T301280)
[19:27:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts.yaml: move dumps to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/762079 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott)
[19:42:40] <wikibugs>	 (03PS1) 10Andrew Bogott: cinder backups: add some new nfs servers to the backup roster [puppet] - 10https://gerrit.wikimedia.org/r/762091 (https://phabricator.wikimedia.org/T301280)
[19:44:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cinder backups: add some new nfs servers to the backup roster [puppet] - 10https://gerrit.wikimedia.org/r/762091 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott)
[19:55:48] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-cinder-backup-manager: fix some volume names [puppet] - 10https://gerrit.wikimedia.org/r/762092
[19:59:43] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-backup-manager: fix some volume names [puppet] - 10https://gerrit.wikimedia.org/r/762092 (owner: 10Andrew Bogott)
[20:47:05] <icinga-wm>	 PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[20:58:53] <icinga-wm>	 RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms
[21:41:05] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[22:57:55] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[22:57:57] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[22:57:58] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[22:58:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:58:02] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[22:58:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:58:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:58:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T300775)', diff saved to https://phabricator.wikimedia.org/P20617 and previous config saved to /var/cache/conftool/dbconfig/20220212-225806-marostegui.json
[22:58:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:58:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:58:14] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775