[00:12:07] PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-07-19 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:15:35] PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-07-19 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:18:53] 10SRE, 10Phabricator, 10Sustainability (Incident Followup): Unable to view tasks in read-only mode - https://phabricator.wikimedia.org/T313879 (10RLazarus) [00:22:33] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10RLazarus) [00:23:06] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10RLazarus) [00:24:18] (03PS3) 10BCornwall: geodns: Map out African countries by DC latency [dns] - 10https://gerrit.wikimedia.org/r/816028 (https://phabricator.wikimedia.org/T311472) [00:24:20] (03PS3) 10BCornwall: geodns: Move eqsin, drmrs and esams around in Asia [dns] - 10https://gerrit.wikimedia.org/r/816053 (https://phabricator.wikimedia.org/T311472) [00:27:31] (03CR) 10BCornwall: geodns: Map out African countries by DC latency (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/816028 (https://phabricator.wikimedia.org/T311472) (owner: 10BCornwall) [00:30:05] PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-07-19 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:33:21] PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2022-07-19 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:39:15] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:46:06] 10SRE, 10SRE-OnFire, 10Shellbox, 10serviceops, 10Sustainability (Incident Followup): Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10RLazarus) 05Open→03Resolved a:03RLazarus [00:57:51] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:01:37] RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2022-07-26 00:00:01 (3286 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:04:53] RECOVERY - dump of es4 in codfw on backupmon1001 is OK: Last dump for es4 at codfw (es2022) taken on 2022-07-26 00:00:02 (3307 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:13:41] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:24:55] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [01:37:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:57] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:03] RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-07-26 00:00:01 (3307 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:22:41] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:22:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:57:13] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:00:49] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:12:34] 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Metrics, 10observability, and 4 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10Krinkle) >>! In T240685#8088120, @colewhite wrote: > From @tstarling's comment, I see a few action items: > # Create a service that can be a drop... [03:18:09] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:55:52] 10SRE, 10SRE-Access-Requests, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10Raymond_Ndibe) [03:55:59] RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-07-26 00:00:02 (3286 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:42:27] (03PS1) 10Krinkle: mediawiki: Extend /portals max-age from 24h to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/817409 (https://phabricator.wikimedia.org/T313881) [04:51:35] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [05:10:29] (03PS1) 10Marostegui: mariadb: Decommission db2086 [puppet] - 10https://gerrit.wikimedia.org/r/817616 (https://phabricator.wikimedia.org/T313482) [05:10:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2086.codfw.wmnet [05:15:18] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [05:19:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:19:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2086.codfw.wmnet [05:19:27] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2086 - https://phabricator.wikimedia.org/T313482 (10Marostegui) a:03Papaul [05:19:36] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2086 - https://phabricator.wikimedia.org/T313482 (10Marostegui) Papaul all yours [05:27:17] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2086 [puppet] - 10https://gerrit.wikimedia.org/r/817616 (https://phabricator.wikimedia.org/T313482) (owner: 10Marostegui) [05:37:17] (03CR) 10Andrea Denisse: [C: 03+1] "Looks good to me! Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/816135 (https://phabricator.wikimedia.org/T312947) (owner: 10Filippo Giunchedi) [05:58:32] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Increase $wgObjectCacheSessionExpiry to 86400 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816060 (https://phabricator.wikimedia.org/T313496) (owner: 10Tim Starling) [06:32:33] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:36:38] (03CR) 10Alexandros Kosiaris: [C: 03+2] conf100[456]: Remove them from client DNS RRs [dns] - 10https://gerrit.wikimedia.org/r/817260 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris) [06:36:52] (03CR) 10Alexandros Kosiaris: [C: 03+2] "This should move mediawiki to using the newer hosts exclusively" [dns] - 10https://gerrit.wikimedia.org/r/817260 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris) [06:39:20] (03CR) 10Alexandros Kosiaris: [C: 03+1] "This will switch over pybal to use the newer hosts. I 'll set a cumin based slow restart of pybal in a screen on cumin1001, with say an in" [puppet] - 10https://gerrit.wikimedia.org/r/817264 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris) [06:49:43] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:53:03] (03PS2) 10Alexandros Kosiaris: Switch zookeeper clients to use conf100[789] [puppet] - 10https://gerrit.wikimedia.org/r/817265 (https://phabricator.wikimedia.org/T311408) [06:53:10] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Switch zookeeper clients to use conf100[789] [puppet] - 10https://gerrit.wikimedia.org/r/817265 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris) [06:58:05] (03PS1) 10Giuseppe Lavagetto: parsoid::testing: install php 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/817699 (https://phabricator.wikimedia.org/T312638) [06:58:07] (03PS1) 10Giuseppe Lavagetto: mediawiki::webserver: fallback php version for webrequests [puppet] - 10https://gerrit.wikimedia.org/r/817700 (https://phabricator.wikimedia.org/T271736) [06:58:09] (03PS1) 10Giuseppe Lavagetto: parsoid::testing: switch to php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/817701 (https://phabricator.wikimedia.org/T312638) [06:59:23] (03CR) 10CI reject: [V: 04-1] mediawiki::webserver: fallback php version for webrequests [puppet] - 10https://gerrit.wikimedia.org/r/817700 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [07:00:04] 10SRE-OnFire, 10observability, 10SRE Observability (FY2022/2023-Q1): Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10SLyngshede-WMF) A slightly weird way of handling the issue automatically could be using Se... [07:00:05] Amir1 and Urbanecm: Your horoscope predicts another unfortunate UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220727T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:35] PROBLEM - Check systemd state on kafkamon1002 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-jumbo-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:01:13] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36422/console" [puppet] - 10https://gerrit.wikimedia.org/r/817699 (https://phabricator.wikimedia.org/T312638) (owner: 10Giuseppe Lavagetto) [07:01:15] (03PS1) 10Marostegui: mariadb: Promote db2161 to s8 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/817702 (https://phabricator.wikimedia.org/T313798) [07:03:17] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 14 hosts with reason: codfw s8 master switch [07:03:28] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 14 hosts with reason: codfw s8 master switch [07:05:08] !log Restart db2161 to change its binlog format [07:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:45] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:07:11] (03PS1) 10Alexandros Kosiaris: etcd-backup: Switch shebang to python3 [puppet] - 10https://gerrit.wikimedia.org/r/817704 [07:07:21] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] parsoid::testing: install php 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/817699 (https://phabricator.wikimedia.org/T312638) (owner: 10Giuseppe Lavagetto) [07:08:47] PROBLEM - Check systemd state on ms-be2065 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:09:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2161 with weight 0 T313798', diff saved to https://phabricator.wikimedia.org/P31954 and previous config saved to /var/cache/conftool/dbconfig/20220727-070901-marostegui.json [07:09:06] T313798: Switchover s8 codfw master - https://phabricator.wikimedia.org/T313798 [07:10:37] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2065 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:10:41] (03CR) 10Alexandros Kosiaris: [C: 03+2] etcd-backup: Switch shebang to python3 [puppet] - 10https://gerrit.wikimedia.org/r/817704 (owner: 10Alexandros Kosiaris) [07:12:39] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2161 to s8 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/817702 (https://phabricator.wikimedia.org/T313798) (owner: 10Marostegui) [07:16:15] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1082 - https://phabricator.wikimedia.org/T312626 (10Volans) [07:18:01] (03PS3) 10Filippo Giunchedi: prometheus: update blackbox check alerts runbook link [puppet] - 10https://gerrit.wikimedia.org/r/816135 (https://phabricator.wikimedia.org/T312947) [07:18:11] !log restarted ferm on ms-be2065 (had failed for a timed out query) [07:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:31] RECOVERY - Check systemd state on ms-be2065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:19:01] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1082 - https://phabricator.wikimedia.org/T312626 (10Volans) FYI this is still randomly alerting on IRC: ` icinga-wm| PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBac... [07:21:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:27:13] PROBLEM - Check systemd state on ms-be1065 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:27:23] RECOVERY - Check systemd state on conf1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:28:03] RECOVERY - Check systemd state on conf1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:28:13] RECOVERY - Check systemd state on conf1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:28:47] RECOVERY - Check unit status of etcd-backup on conf1008 is OK: OK: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:29:03] RECOVERY - Check unit status of etcd-backup on conf1007 is OK: OK: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:30:23] !log restarted ferm on ms-be1065 (had failed for a timed out query) [07:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2161 to s8 codfw primary T313798', diff saved to https://phabricator.wikimedia.org/P31955 and previous config saved to /var/cache/conftool/dbconfig/20220727-073214-marostegui.json [07:32:19] T313798: Switchover s8 codfw master - https://phabricator.wikimedia.org/T313798 [07:32:59] RECOVERY - Check unit status of etcd-backup on conf1009 is OK: OK: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:33:29] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [07:34:27] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [07:34:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2079 T313798', diff saved to https://phabricator.wikimedia.org/P31956 and previous config saved to /var/cache/conftool/dbconfig/20220727-073442-marostegui.json [07:34:46] 10SRE, 10SRE-Access-Requests, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10Volans) p:05Triage→03Medium [07:37:18] (03PS1) 10Marostegui: db2165: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/817706 [07:38:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [07:38:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [07:38:46] (03CR) 10Marostegui: [C: 03+2] db2165: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/817706 (owner: 10Marostegui) [07:38:54] 10SRE, 10SRE-Access-Requests: Requesting access to phab1001/phab2001 for Daniel Duvall (dduvall) - https://phabricator.wikimedia.org/T313831 (10Volans) p:05Triage→03Medium [07:40:55] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2065 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:41:09] 10SRE, 10SRE-Access-Requests: Requesting access to phab1001/phab2001 for Daniel Duvall (dduvall) - https://phabricator.wikimedia.org/T313831 (10Volans) @thcipriani: your approval both as group approver and manager is required here ;) I'm preparing the patch in the meanwhile. [07:41:29] (03PS1) 10Volans: admin: add dduvall to phabricator-roots [puppet] - 10https://gerrit.wikimedia.org/r/817707 (https://phabricator.wikimedia.org/T313831) [07:41:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [07:41:48] (03CR) 10Volans: [C: 04-1] "Waiting approval on task." [puppet] - 10https://gerrit.wikimedia.org/r/817707 (https://phabricator.wikimedia.org/T313831) (owner: 10Volans) [07:41:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [07:42:26] (03PS1) 10Marostegui: db2079: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817708 (https://phabricator.wikimedia.org/T313885) [07:42:29] RECOVERY - Check systemd state on ms-be1065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:43:19] (03CR) 10Marostegui: [C: 03+2] db2079: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817708 (https://phabricator.wikimedia.org/T313885) (owner: 10Marostegui) [07:45:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [07:45:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [07:45:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T312990)', diff saved to https://phabricator.wikimedia.org/P31957 and previous config saved to /var/cache/conftool/dbconfig/20220727-074546-marostegui.json [07:45:50] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [07:46:34] (03PS1) 10Marostegui: mariadb: Productionize db2172 [puppet] - 10https://gerrit.wikimedia.org/r/817709 (https://phabricator.wikimedia.org/T311493) [07:48:35] (03PS2) 10Marostegui: mariadb: Productionize db2172 [puppet] - 10https://gerrit.wikimedia.org/r/817709 (https://phabricator.wikimedia.org/T311493) [07:50:05] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2172 [puppet] - 10https://gerrit.wikimedia.org/r/817709 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [07:50:58] (03PS2) 10KartikMistry: Update cxserver to 2022-07-27-070728-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/817089 (https://phabricator.wikimedia.org/T313300) [07:51:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:13] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36423/console" [puppet] - 10https://gerrit.wikimedia.org/r/815769 (https://phabricator.wikimedia.org/T311746) (owner: 10Dduvall) [07:55:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T312990)', diff saved to https://phabricator.wikimedia.org/P31958 and previous config saved to /var/cache/conftool/dbconfig/20220727-075523-marostegui.json [07:55:28] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [07:55:59] (03PS1) 10Marostegui: db2170: Add it to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/817710 (https://phabricator.wikimedia.org/T311493) [07:56:23] (03CR) 10Vgutierrez: [C: 03+1] "LGTM, Filippo already gave you a +1 so I'm assuming that the intermediate step of setting ensure to absent isn't required" [puppet] - 10https://gerrit.wikimedia.org/r/814894 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [07:57:20] (03CR) 10Marostegui: [C: 03+2] db2170: Add it to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/817710 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [07:59:35] 10SRE, 10SRE-Access-Requests, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10Volans) [08:00:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2170 (s1, s2) to dbctl T311493', diff saved to https://phabricator.wikimedia.org/P31959 and previous config saved to /var/cache/conftool/dbconfig/20220727-080029-marostegui.json [08:00:34] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [08:00:35] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:51] 10SRE, 10SRE-Access-Requests, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10Volans) @Raymond_Ndibe the group `cloud-roots` you're requesting access to does not exists. Did you meant `wmcs-roots`? Please check the groups defined in http... [08:04:59] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for EllenR - https://phabricator.wikimedia.org/T313821 (10Volans) @ERayfield some clarification is needed: > Do you currently have shell access (Yes/No)? > Yes With which user do you have shell access? I was able to find the `erayfiel... [08:09:15] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:09:51] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:10:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P31960 and previous config saved to /var/cache/conftool/dbconfig/20220727-081029-marostegui.json [08:10:33] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48390 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:11:13] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.286 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:15:32] (03PS1) 10Marostegui: db2171: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817711 (https://phabricator.wikimedia.org/T311493) [08:16:24] (03CR) 10Marostegui: [C: 03+2] db2171: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817711 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [08:16:37] (03PS1) 10Volans: Add configurationf for the release script [software/pywmflib] - 10https://gerrit.wikimedia.org/r/817712 [08:18:48] (03PS2) 10Volans: Add configuration for the release script [software/pywmflib] - 10https://gerrit.wikimedia.org/r/817712 [08:23:23] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:25:02] (03PS1) 10Marostegui: instances.yaml: Add db2171 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/817715 (https://phabricator.wikimedia.org/T311493) [08:25:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P31961 and previous config saved to /var/cache/conftool/dbconfig/20220727-082535-marostegui.json [08:26:13] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2171 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/817715 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [08:26:46] (03PS1) 10Volans: Add configuration for the release script [software/spicerack] - 10https://gerrit.wikimedia.org/r/817717 [08:28:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2171 (s5, s6) to dbctl T311493', diff saved to https://phabricator.wikimedia.org/P31962 and previous config saved to /var/cache/conftool/dbconfig/20220727-082817-marostegui.json [08:28:22] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [08:30:47] 10SRE, 10SRE-Access-Requests: Requesting access to maintenance servers for mfossati - https://phabricator.wikimedia.org/T313706 (10mfossati) @Volans : I confirm I can `ssh mwmaint1002.eqiad.wmnet`. @thcipriani : I attended a deployment training session, see {T302204}. I've also scheduled another one: {T313812}... [08:31:53] (03PS1) 10Volans: Add configuration for the release script [software/homer] - 10https://gerrit.wikimedia.org/r/817719 [08:33:13] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP wmf group for Aline Bruenger WMDE - https://phabricator.wikimedia.org/T312220 (10Aline_Bruenger_WMDE) 05Open→03Resolved a:03Aline_Bruenger_WMDE Thank you very much, @Joe, @Volans and @Dzahn! Apologies for confusing the LDAP groups and for my late resp... [08:33:53] (03CR) 10FNegri: wmcs-cinder-backup: fix Retrying() call (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816841 (owner: 10Andrew Bogott) [08:40:10] 10SRE, 10SRE-Access-Requests: Requesting access to maintenance servers for mfossati - https://phabricator.wikimedia.org/T313706 (10Volans) @mfossati great! I think we could close this task then and when the time comes open a separate one for `deployment`. Please mention in the future request to convert `restri... [08:40:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T312990)', diff saved to https://phabricator.wikimedia.org/P31964 and previous config saved to /var/cache/conftool/dbconfig/20220727-084042-marostegui.json [08:40:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [08:40:47] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [08:40:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [08:40:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:41:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:41:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T312990)', diff saved to https://phabricator.wikimedia.org/P31965 and previous config saved to /var/cache/conftool/dbconfig/20220727-084120-marostegui.json [08:42:14] 10SRE, 10SRE-Access-Requests: Requesting access to maintenance servers for mfossati - https://phabricator.wikimedia.org/T313706 (10mfossati) 05Open→03Resolved a:03mfossati That sounds good, closing! [08:43:15] (03PS1) 10Marostegui: site.pp: Remove insetup role from db2172 [puppet] - 10https://gerrit.wikimedia.org/r/817721 (https://phabricator.wikimedia.org/T311493) [08:44:52] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup role from db2172 [puppet] - 10https://gerrit.wikimedia.org/r/817721 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [08:46:44] (03CR) 10Jelto: [V: 03+1 C: 03+2] "lgtm, thanks a lot! I'm going to test and deploy this on all Runners. Hopefully with one last de-registering and re-registering ;)" [puppet] - 10https://gerrit.wikimedia.org/r/815769 (https://phabricator.wikimedia.org/T311746) (owner: 10Dduvall) [08:47:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T312990)', diff saved to https://phabricator.wikimedia.org/P31966 and previous config saved to /var/cache/conftool/dbconfig/20220727-084715-marostegui.json [08:47:20] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [08:47:59] jouncebot: nowandnext [08:47:59] No deployments scheduled for the next 4 hour(s) and 12 minute(s) [08:47:59] In 4 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220727T1300) [08:48:03] awesome [08:48:48] (03PS1) 10Volans: Add configuration for the release script [software/debmonitor] - 10https://gerrit.wikimedia.org/r/817722 [08:48:55] (03PS1) 10Ladsgroup: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817723 [08:49:57] akosiaris: o/ [08:50:13] burrow on kafkamon1002 seems broken after the conf hostname changes in hiera [08:52:17] elukey: o/ [08:52:20] having a look [08:53:05] elukey: unsurprising. I restarted the wrong unit. fixed [08:55:05] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:55:19] akosiaris: it has been a while since I looked into the hosts, so the "burrow" unit should be masked IIRC.. the other more specific units are still failing for a missing pid file though [08:56:07] probably /var/run/burrow is missing, puppet doesn't create it [08:56:57] yeah [08:57:25] !log manually create /var/run/burrow on kafkamon1002 to allow a clean restart of Burrow daemons (after zookeeper config change) [08:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:51] (03PS1) 10Filippo Giunchedi: hieradata: set all new swift hosts with 24 HDD [puppet] - 10https://gerrit.wikimedia.org/r/817724 (https://phabricator.wikimedia.org/T294549) [08:57:52] !log restart burrow-* on kafkamon1002 to pick up zookeeper changes [08:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:37] (03CR) 10Ladsgroup: [C: 03+2] Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817723 (owner: 10Ladsgroup) [08:58:51] RECOVERY - Check systemd state on kafkamon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:27] (03Merged) 10jenkins-bot: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817723 (owner: 10Ladsgroup) [09:00:26] (03CR) 10Filippo Giunchedi: "See inline, need to add hosts to hieradata/common/profile/swift.yaml too" [puppet] - 10https://gerrit.wikimedia.org/r/817209 (https://phabricator.wikimedia.org/T294549) (owner: 10MVernon) [09:00:41] (03PS1) 10Volans: Add configuration for the release script [software/cumin] - 10https://gerrit.wikimedia.org/r/817726 [09:01:27] !log reboot ml-serve2001 - T313822 [09:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:37] T313822: codfw: ml-serve2001 memmory issue DIMM A2 - https://phabricator.wikimedia.org/T313822 [09:02:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P31967 and previous config saved to /var/cache/conftool/dbconfig/20220727-090221-marostegui.json [09:02:31] (03CR) 10MVernon: [C: 03+1] "Thanks for this, looks like what we want :)" [puppet] - 10https://gerrit.wikimedia.org/r/817724 (https://phabricator.wikimedia.org/T294549) (owner: 10Filippo Giunchedi) [09:02:43] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2001.codfw.wmnet [09:02:58] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: set all new swift hosts with 24 HDD [puppet] - 10https://gerrit.wikimedia.org/r/817724 (https://phabricator.wikimedia.org/T294549) (owner: 10Filippo Giunchedi) [09:03:18] (03PS2) 10Volans: Add configuration for the release script [software/cumin] - 10https://gerrit.wikimedia.org/r/817726 [09:05:43] !log ladsgroup@deploy1002 Synchronized portals/wikipedia.org/assets: Fixing favicon of wikiquote and wikibooks, take II (duration: 03m 49s) [09:06:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:08:36] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: update blackbox check alerts runbook link [puppet] - 10https://gerrit.wikimedia.org/r/816135 (https://phabricator.wikimedia.org/T312947) (owner: 10Filippo Giunchedi) [09:09:08] !log ladsgroup@deploy1002 Synchronized portals: Fixing favicon of wikiquote and wikibooks, take II (duration: 03m 24s) [09:11:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:11:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:11:23] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2001.codfw.wmnet [09:11:36] 10SRE, 10ops-codfw, 10Machine-Learning-Team: codfw: ml-serve2001 memmory issue DIMM A2 - https://phabricator.wikimedia.org/T313822 (10elukey) @Papaul host rebooted! It is not running any K8s pods at the moment so if any maintenance is needed, feel free to downtime and go ahead :) For the ML-Team - the node... [09:11:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:13:59] (03PS1) 10Marostegui: instances.yaml: Remove db2087 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/817728 (https://phabricator.wikimedia.org/T313483) [09:15:06] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2087 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/817728 (https://phabricator.wikimedia.org/T313483) (owner: 10Marostegui) [09:17:17] (03PS1) 10Marostegui: mariadb: Productionize db2173 [puppet] - 10https://gerrit.wikimedia.org/r/817729 (https://phabricator.wikimedia.org/T311493) [09:18:13] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2173 [puppet] - 10https://gerrit.wikimedia.org/r/817729 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [09:21:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2087.codfw.wmnet [09:22:22] (03PS1) 10Marostegui: mariadb: Decommission db2087 [puppet] - 10https://gerrit.wikimedia.org/r/817731 (https://phabricator.wikimedia.org/T313483) [09:24:58] (03PS1) 10Elukey: ml-services: move prod docker images to KServe 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/817732 (https://phabricator.wikimedia.org/T311982) [09:25:36] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2087 [puppet] - 10https://gerrit.wikimedia.org/r/817731 (https://phabricator.wikimedia.org/T313483) (owner: 10Marostegui) [09:25:39] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [09:26:54] 10SRE, 10Phabricator, 10Sustainability (Incident Followup): Unable to view tasks in DB read-only mode - https://phabricator.wikimedia.org/T313879 (10Aklapper) [09:28:33] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:28:45] ^ fixing [09:29:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2087 from dbctl T313483', diff saved to https://phabricator.wikimedia.org/P31968 and previous config saved to /var/cache/conftool/dbconfig/20220727-092917-marostegui.json [09:29:22] T313483: decommission db2087 - https://phabricator.wikimedia.org/T313483 [09:29:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P31969 and previous config saved to /var/cache/conftool/dbconfig/20220727-092924-marostegui.json [09:29:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:29:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2087.codfw.wmnet [09:30:21] 10ops-codfw, 10decommission-hardware: decommission db2087 - https://phabricator.wikimedia.org/T313483 (10Marostegui) a:03Papaul [09:30:28] 10ops-codfw, 10decommission-hardware: decommission db2087 - https://phabricator.wikimedia.org/T313483 (10Marostegui) @Papaul this is ready [09:31:59] !log ladsgroup@deploy1002 Synchronized portals/wikipedia.org/assets: Fixing favicon of wikiquote and wikibooks, take III (duration: 03m 19s) [09:32:19] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on ml-serve2001.codfw.wmnet with reason: memtest86+ run [09:32:33] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on ml-serve2001.codfw.wmnet with reason: memtest86+ run [09:32:39] 10SRE, 10ops-codfw, 10Machine-Learning-Team: codfw: ml-serve2001 memmory issue DIMM A2 - https://phabricator.wikimedia.org/T313822 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b087dff3-f32b-4842-9f10-401f09f59c0c) set by klausman@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their ser... [09:33:34] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:35:36] !log ladsgroup@deploy1002 Synchronized portals: Fixing favicon of wikiquote and wikibooks, take III (duration: 03m 36s) [09:36:40] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:37:16] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:39:58] (KubernetesCalicoDown) firing: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:43:06] klausman: I don't recall if the downtime cookbook also downtimes alerts.wikimedia.org too, from --^ it seems that we may need to add more downtime [09:43:18] Will do [09:44:07] elukey: it does for anything with the instance in alertmanager matching what you downtimed [09:44:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T312990)', diff saved to https://phabricator.wikimedia.org/P31970 and previous config saved to /var/cache/conftool/dbconfig/20220727-094430-marostegui.json [09:44:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [09:44:36] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [09:44:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [09:44:52] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:44:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T312990)', diff saved to https://phabricator.wikimedia.org/P31971 and previous config saved to /var/cache/conftool/dbconfig/20220727-094452-marostegui.json [09:44:58] volans: I added a 7d downtime for ml-serve2001.codfw.wmnet, which apparently didn't cover the above alert [09:46:05] it matches instance=~"^(ml\-serve2001)(:[0-9]+)?$" [09:46:08] on alertmanager [09:46:35] cc godog for more expertise :) [09:46:48] That rehex doesn't include the domain [09:46:56] silence ID is b087dff3-f32b-4842-9f10-401f09f59c0c [09:47:13] klausman: yes, because the instance should not have it [09:47:15] (03PS1) 10Marostegui: site.pp: Promote db2144 to x2 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/817736 (https://phabricator.wikimedia.org/T313811) [09:47:40] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/817736 (https://phabricator.wikimedia.org/T313811) (owner: 10Marostegui) [09:47:45] volans: except it does :) [09:48:11] * volans checking for the related task [09:48:18] ah yeah, I remember we ran into this issue before [09:48:40] T304481 cc volans [09:48:41] T304481: kubernetes / calico alerts have instance with fqdn not hostname - https://phabricator.wikimedia.org/T304481 [09:48:48] yeah just found it [09:49:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [09:49:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [09:49:39] Can silences be edited? [09:50:36] klausman: certainly, you can do it via web from alerts.w.o top right 'bell' icon [09:50:49] then 'browse', and 'edit' on your silence's entry [09:51:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T312990)', diff saved to https://phabricator.wikimedia.org/P31972 and previous config saved to /var/cache/conftool/dbconfig/20220727-095101-marostegui.json [09:51:08] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [09:51:53] unfortunately the bandwidth/priority of the task on my end is unchanged at this time :( [09:52:35] but yeah re-reading the task I think we should aim at having hostnames and not fqdn, as suggested [09:52:52] Ok, edited RE for now [09:55:31] (03CR) 10Ladsgroup: [C: 03+1] site.pp: Promote db2144 to x2 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/817736 (https://phabricator.wikimedia.org/T313811) (owner: 10Marostegui) [09:55:40] (03PS2) 10Giuseppe Lavagetto: mediawiki::webserver: fallback php version for webrequests [puppet] - 10https://gerrit.wikimedia.org/r/817700 (https://phabricator.wikimedia.org/T271736) [09:55:42] (03PS2) 10Giuseppe Lavagetto: parsoid::testing: switch to php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/817701 (https://phabricator.wikimedia.org/T312638) [09:56:38] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:57:38] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36425/console" [puppet] - 10https://gerrit.wikimedia.org/r/817701 (https://phabricator.wikimedia.org/T312638) (owner: 10Giuseppe Lavagetto) [10:01:42] 10SRE, 10ops-codfw, 10Machine-Learning-Team: codfw: ml-serve2001 memmory issue DIMM A2 - https://phabricator.wikimedia.org/T313822 (10klausman) Ok, the machine is booted and sitting in GRUB. @Papaul I can't seem to run memtes86+ via idrac (I just get a black screen). Can you check whether it works with direc... [10:02:55] (03CR) 10Lucas Werkmeister (WMDE): [WIP] Tune wikidata language selector autocomplete (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817317 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [10:03:19] (03CR) 10Klausman: [C: 03+2] ml-services: Add eu, hu & hy wiki articletopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/817269 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira) [10:04:33] (03PS1) 10Filippo Giunchedi: customscripts: export 'mgmt' entries from hiera_export [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) [10:05:36] (03CR) 10Jcrespo: "I've realized that prometheus needs access to zarcillo for job generation, I will include it." [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo) [10:06:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P31974 and previous config saved to /var/cache/conftool/dbconfig/20220727-100607-marostegui.json [10:07:59] (03CR) 10Filippo Giunchedi: "The result is sth like the following:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [10:08:50] 10SRE, 10serviceops, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10LSobanski) [10:09:04] (03CR) 10Jbond: [C: 03+1] aptrepo: update gitlab-ce & gitlab-runner to 15.0 [puppet] - 10https://gerrit.wikimedia.org/r/817211 (https://phabricator.wikimedia.org/T309062) (owner: 10Jelto) [10:09:21] (03Merged) 10jenkins-bot: ml-services: Add eu, hu & hy wiki articletopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/817269 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira) [10:10:02] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:11:26] (03PS1) 10Marostegui: install_server: Do not reimage db217[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/817741 (https://phabricator.wikimedia.org/T311493) [10:12:41] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db217[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/817741 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [10:12:53] (03Abandoned) 10Marostegui: install_server: Do not reimage db217[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/817741 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [10:14:00] (03PS1) 10Marostegui: install_server: Do not reimage db217[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/817742 (https://phabricator.wikimedia.org/T311493) [10:15:04] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db217[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/817742 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [10:15:49] (03CR) 10Jbond: "I also need to run the vtc tests before merging this" [puppet] - 10https://gerrit.wikimedia.org/r/817299 (owner: 10Jbond) [10:18:24] (03PS7) 10Jcrespo: db_inventory: Cleanup zarcillo database grants [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) [10:21:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P31976 and previous config saved to /var/cache/conftool/dbconfig/20220727-102113-marostegui.json [10:21:21] (03PS1) 10Klausman: ml-k8s: add dummy secrects for articleoutlink [labs/private] - 10https://gerrit.wikimedia.org/r/817744 (https://phabricator.wikimedia.org/T313888) [10:27:17] /away/go 5 [10:27:29] woops, mb [10:28:14] (03PS4) 10Jbond: P:base::firewall: Add requestctl definitions to ferm [puppet] - 10https://gerrit.wikimedia.org/r/817307 (https://phabricator.wikimedia.org/T313825) [10:30:06] 10SRE, 10SRE-Access-Requests, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10Volans) @Raymond_Ndibe I've noticed that you currently have 4 different SSH keys in your Wikitech (LDAP) account, and the comments on the keys have different n... [10:32:49] (03PS1) 10Marostegui: db2072: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817747 (https://phabricator.wikimedia.org/T311493) [10:33:29] (03CR) 10Marostegui: [C: 03+2] db2072: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817747 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [10:34:16] (03PS3) 10MVernon: swift: drain older systems, bring some new ones online [puppet] - 10https://gerrit.wikimedia.org/r/817209 (https://phabricator.wikimedia.org/T294549) [10:36:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T312990)', diff saved to https://phabricator.wikimedia.org/P31978 and previous config saved to /var/cache/conftool/dbconfig/20220727-103619-marostegui.json [10:36:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [10:36:25] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [10:36:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [10:36:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T312990)', diff saved to https://phabricator.wikimedia.org/P31979 and previous config saved to /var/cache/conftool/dbconfig/20220727-103640-marostegui.json [10:37:21] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/817209 (https://phabricator.wikimedia.org/T294549) (owner: 10MVernon) [10:42:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T312990)', diff saved to https://phabricator.wikimedia.org/P31980 and previous config saved to /var/cache/conftool/dbconfig/20220727-104204-marostegui.json [10:42:11] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [10:42:53] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/817209 (https://phabricator.wikimedia.org/T294549) (owner: 10MVernon) [10:46:29] !log update cassandradev packages for stretch to 3.11.13 T313742 [10:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:33] T313742: Import Cassandra 3.11.13 as 'dev', Stretch - https://phabricator.wikimedia.org/T313742 [10:47:02] (03CR) 10Volans: "Reviewed current implementation, replied to comment with alternative approach." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [10:49:45] (03PS2) 10Klausman: ml-k8s: add dummy secrects for articleoutlink [labs/private] - 10https://gerrit.wikimedia.org/r/817744 (https://phabricator.wikimedia.org/T313888) [10:50:43] (03CR) 10MVernon: [C: 03+2] swift: drain older systems, bring some new ones online [puppet] - 10https://gerrit.wikimedia.org/r/817209 (https://phabricator.wikimedia.org/T294549) (owner: 10MVernon) [10:50:50] (03PS3) 10Klausman: ml-k8s: add dummy secrects for article-outlink [labs/private] - 10https://gerrit.wikimedia.org/r/817744 (https://phabricator.wikimedia.org/T313888) [10:51:04] (03CR) 10Klausman: [C: 03+2] ml-k8s: add dummy secrects for article-outlink [labs/private] - 10https://gerrit.wikimedia.org/r/817744 (https://phabricator.wikimedia.org/T313888) (owner: 10Klausman) [10:51:09] (03CR) 10Klausman: [V: 03+2 C: 03+2] ml-k8s: add dummy secrects for article-outlink [labs/private] - 10https://gerrit.wikimedia.org/r/817744 (https://phabricator.wikimedia.org/T313888) (owner: 10Klausman) [10:57:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P31981 and previous config saved to /var/cache/conftool/dbconfig/20220727-105710-marostegui.json [10:57:32] (03PS1) 10Klausman: ml k8s: Add article-outlink section to deployment server cfg [puppet] - 10https://gerrit.wikimedia.org/r/817749 (https://phabricator.wikimedia.org/T313888) [11:04:52] (03CR) 10Jbond: [C: 03+1] Add configuration for the release script [software/pywmflib] - 10https://gerrit.wikimedia.org/r/817712 (owner: 10Volans) [11:05:08] (03CR) 10Jbond: [C: 03+1] Add configuration for the release script [software/spicerack] - 10https://gerrit.wikimedia.org/r/817717 (owner: 10Volans) [11:05:28] (03CR) 10Jbond: [C: 03+1] Add configuration for the release script [software/homer] - 10https://gerrit.wikimedia.org/r/817719 (owner: 10Volans) [11:06:42] (03CR) 10Filippo Giunchedi: "Thank you for the review!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [11:07:38] (03CR) 10Volans: [C: 03+2] Add configuration for the release script [software/pywmflib] - 10https://gerrit.wikimedia.org/r/817712 (owner: 10Volans) [11:07:44] (03CR) 10Volans: [C: 03+2] Add configuration for the release script [software/spicerack] - 10https://gerrit.wikimedia.org/r/817717 (owner: 10Volans) [11:07:52] (03CR) 10Volans: [C: 03+2] Add configuration for the release script [software/homer] - 10https://gerrit.wikimedia.org/r/817719 (owner: 10Volans) [11:08:00] (03PS1) 10Klausman: ML k8s: fix articletopic-outlink names [labs/private] - 10https://gerrit.wikimedia.org/r/817750 (https://phabricator.wikimedia.org/T313888) [11:09:45] (03CR) 10Klausman: [V: 03+2 C: 03+2] ML k8s: fix articletopic-outlink names [labs/private] - 10https://gerrit.wikimedia.org/r/817750 (https://phabricator.wikimedia.org/T313888) (owner: 10Klausman) [11:11:26] (03Merged) 10jenkins-bot: Add configuration for the release script [software/pywmflib] - 10https://gerrit.wikimedia.org/r/817712 (owner: 10Volans) [11:12:01] (03Merged) 10jenkins-bot: Add configuration for the release script [software/homer] - 10https://gerrit.wikimedia.org/r/817719 (owner: 10Volans) [11:12:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:12:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P31982 and previous config saved to /var/cache/conftool/dbconfig/20220727-111216-marostegui.json [11:13:07] (03PS1) 10Klausman: ML k8s: add configuration for articletopic-outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/817751 (https://phabricator.wikimedia.org/T313888) [11:14:16] (03Merged) 10jenkins-bot: Add configuration for the release script [software/spicerack] - 10https://gerrit.wikimedia.org/r/817717 (owner: 10Volans) [11:15:02] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:18] Emperor: ^^^ [11:16:48] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:19:51] (03CR) 10Vgutierrez: [C: 03+1] "looks good," [puppet] - 10https://gerrit.wikimedia.org/r/816206 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori) [11:27:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T312990)', diff saved to https://phabricator.wikimedia.org/P31983 and previous config saved to /var/cache/conftool/dbconfig/20220727-112722-marostegui.json [11:27:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:27:32] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [11:27:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:28:28] (03CR) 10Klausman: [C: 03+1] "LGTM. Maybe add a short note to the commit msg on why the predictor stanzas are deleted." [deployment-charts] - 10https://gerrit.wikimedia.org/r/817732 (https://phabricator.wikimedia.org/T311982) (owner: 10Elukey) [11:28:35] volans: thanks for the ping; that alert also crops up in #wikimedia-data-persistence [11:31:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [11:31:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [11:31:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T312990)', diff saved to https://phabricator.wikimedia.org/P31984 and previous config saved to /var/cache/conftool/dbconfig/20220727-113136-marostegui.json [11:35:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T312990)', diff saved to https://phabricator.wikimedia.org/P31985 and previous config saved to /var/cache/conftool/dbconfig/20220727-113557-marostegui.json [11:36:03] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [11:37:48] * kart_ updating cxserver.. [11:42:27] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-07-27-070728-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/817089 (https://phabricator.wikimedia.org/T313300) (owner: 10KartikMistry) [11:45:03] (03PS1) 10Marostegui: mariadb: Productionize db2174 [puppet] - 10https://gerrit.wikimedia.org/r/817755 (https://phabricator.wikimedia.org/T311493) [11:46:01] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2174 [puppet] - 10https://gerrit.wikimedia.org/r/817755 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [11:46:10] (03Merged) 10jenkins-bot: Update cxserver to 2022-07-27-070728-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/817089 (https://phabricator.wikimedia.org/T313300) (owner: 10KartikMistry) [11:46:18] Emperor: ack, sorry then, I was worried could go unnoticed here will all the rest [11:46:40] NP [11:47:53] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [11:48:26] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [11:50:58] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:51:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P31986 and previous config saved to /var/cache/conftool/dbconfig/20220727-115103-marostegui.json [11:51:52] /away [11:51:56] lolz [11:53:20] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [11:54:06] (03PS1) 10FNegri: Add fnegri's SSH key to authorized keys [labs/private] - 10https://gerrit.wikimedia.org/r/817757 (https://phabricator.wikimedia.org/T312597) [11:54:09] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [11:54:47] Grafana no longer displays restart/deploys? There is a switch, but seems not working. [11:56:52] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [11:57:41] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:00:50] !log Updated cxserver to 2022-07-27-070728-production (T313300, T309577, T310873, T310880) [12:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:58] T309577: Drop support for and use for inline media - https://phabricator.wikimedia.org/T309577 [12:00:58] T310873: Post-creation work for blkwiki - https://phabricator.wikimedia.org/T310873 [12:00:58] T313300: Enable Section Translation on 10 more Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T313300 [12:00:58] T310880: Post-creation work for pcmwiki - https://phabricator.wikimedia.org/T310880 [12:06:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P31987 and previous config saved to /var/cache/conftool/dbconfig/20220727-120609-marostegui.json [12:07:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:07:38] (03PS1) 10KartikMistry: Enable SectionTranslation on 10 more WPs where ContentTranslation is available by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817758 (https://phabricator.wikimedia.org/T313300) [12:10:37] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:14:43] (03CR) 10DCausse: [WIP] Tune wikidata language selector autocomplete (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817317 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [12:15:17] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:05] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [12:17:34] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [12:19:21] (03PS1) 10Jelto: gitlab_runner: fix re-registration issues [puppet] - 10https://gerrit.wikimedia.org/r/817759 (https://phabricator.wikimedia.org/T311746) [12:21:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T312990)', diff saved to https://phabricator.wikimedia.org/P31988 and previous config saved to /var/cache/conftool/dbconfig/20220727-122115-marostegui.json [12:21:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [12:21:20] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [12:21:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [12:21:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T312990)', diff saved to https://phabricator.wikimedia.org/P31989 and previous config saved to /var/cache/conftool/dbconfig/20220727-122147-marostegui.json [12:23:54] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36426/console" [puppet] - 10https://gerrit.wikimedia.org/r/817759 (https://phabricator.wikimedia.org/T311746) (owner: 10Jelto) [12:29:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T312990)', diff saved to https://phabricator.wikimedia.org/P31990 and previous config saved to /var/cache/conftool/dbconfig/20220727-122920-marostegui.json [12:29:26] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [12:31:07] (03PS2) 10DCausse: Tune the wikidata "language" profile for wbsearchentities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817317 (https://phabricator.wikimedia.org/T307869) [12:32:13] (03CR) 10DCausse: Tune the wikidata "language" profile for wbsearchentities (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817317 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [12:36:53] (03PS1) 10Jaime Nuche: scap: enable target bootstrap in beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/817762 (https://phabricator.wikimedia.org/T303559) [12:36:57] (03PS5) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 [12:37:57] (03CR) 10Gehel: [C: 03+2] elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson) [12:40:26] (03CR) 10Volans: "test nit inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson) [12:42:24] (03CR) 10CI reject: [V: 04-1] O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 (owner: 10Jbond) [12:44:23] (03PS6) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 [12:44:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P31991 and previous config saved to /var/cache/conftool/dbconfig/20220727-124426-marostegui.json [12:46:05] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review, 10User-zeljkofilipin: +2 for pwangai - https://phabricator.wikimedia.org/T313794 (10pwangai) @Volans I need +2 for mediawiki/* [12:48:16] (03CR) 10CI reject: [V: 04-1] O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 (owner: 10Jbond) [12:56:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36429/console" [puppet] - 10https://gerrit.wikimedia.org/r/817221 (owner: 10Jbond) [12:57:13] (03PS7) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 [12:59:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P31992 and previous config saved to /var/cache/conftool/dbconfig/20220727-125933-marostegui.json [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220727T1300). [13:00:04] phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:53] (03PS8) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 [13:01:34] I can deploy! [13:04:36] (03CR) 10CI reject: [V: 04-1] O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 (owner: 10Jbond) [13:05:40] (03PS9) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 [13:06:54] phuedx: the diffConfig output looks like the rate is still 0 on some Beta wikis, is that correct? [13:06:58] (03PS1) 10Clément Goubert: admin: add cgoubert to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/817767 (https://phabricator.wikimedia.org/T313902) [13:07:34] (my browser isn’t happy with the long output, so I piped https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-diffConfig-docker/12559/timestamps/?time=HH:mm:ss&timeZone=GMT+2&appendLog&locale=en_US into less instead) [13:07:51] in production testwiki seems to be the only wiki with a 1 rate, which sounds right [13:10:11] (03CR) 10CI reject: [V: 04-1] O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 (owner: 10Jbond) [13:10:37] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36431/console" [puppet] - 10https://gerrit.wikimedia.org/r/817221 (owner: 10Jbond) [13:10:39] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:47] phuedx: are you there? [13:10:56] (03CR) 10Elukey: [C: 03+1] ml k8s: Add article-outlink section to deployment server cfg [puppet] - 10https://gerrit.wikimedia.org/r/817749 (https://phabricator.wikimedia.org/T313888) (owner: 10Klausman) [13:11:52] (03CR) 10Klausman: [C: 03+2] ml k8s: Add article-outlink section to deployment server cfg [puppet] - 10https://gerrit.wikimedia.org/r/817749 (https://phabricator.wikimedia.org/T313888) (owner: 10Klausman) [13:12:43] (03CR) 10Elukey: [C: 03+1] ML k8s: add configuration for articletopic-outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/817751 (https://phabricator.wikimedia.org/T313888) (owner: 10Klausman) [13:14:28] (03PS2) 10Elukey: ml-services: move prod docker images to KServe 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/817732 (https://phabricator.wikimedia.org/T311982) [13:14:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T312990)', diff saved to https://phabricator.wikimedia.org/P31993 and previous config saved to /var/cache/conftool/dbconfig/20220727-131439-marostegui.json [13:14:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [13:14:45] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [13:14:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [13:15:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T312990)', diff saved to https://phabricator.wikimedia.org/P31994 and previous config saved to /var/cache/conftool/dbconfig/20220727-131500-marostegui.json [13:16:58] (03PS3) 10Elukey: ml-services: move prod docker images to KServe 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/817732 (https://phabricator.wikimedia.org/T311982) [13:17:01] (03CR) 10Klausman: [C: 03+2] ML k8s: add configuration for articletopic-outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/817751 (https://phabricator.wikimedia.org/T313888) (owner: 10Klausman) [13:18:11] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:39] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:20:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T312990)', diff saved to https://phabricator.wikimedia.org/P31995 and previous config saved to /var/cache/conftool/dbconfig/20220727-132005-marostegui.json [13:20:12] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [13:21:11] RECOVERY - Host ores1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [13:21:17] (03CR) 10Elukey: [C: 03+2] ml-services: move prod docker images to KServe 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/817732 (https://phabricator.wikimedia.org/T311982) (owner: 10Elukey) [13:23:42] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:25:30] (03PS1) 10Xcollazo: airflow - Modify platform_eng instance to do deployment of airflow-dags [puppet] - 10https://gerrit.wikimedia.org/r/817774 (https://phabricator.wikimedia.org/T312858) [13:25:37] I’ll test https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/817317 on mwdebug1001 in the meantime [13:25:47] (03PS1) 10Klausman: deployment-server: fix ML model name for articletopic-outlink [puppet] - 10https://gerrit.wikimedia.org/r/817775 (https://phabricator.wikimedia.org/T313888) [13:26:07] (03PS1) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817777 (https://phabricator.wikimedia.org/T313910) [13:26:28] (03CR) 10Klausman: [V: 03+2 C: 03+2] deployment-server: fix ML model name for articletopic-outlink [puppet] - 10https://gerrit.wikimedia.org/r/817775 (https://phabricator.wikimedia.org/T313888) (owner: 10Klausman) [13:27:09] 7 [13:27:39] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [13:28:43] (03CR) 10Lucas Werkmeister (WMDE): "I was actually going to ask if this change would make the negative boost for disambiguations etc. ineffective if they occurred together wi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817317 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [13:30:10] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:31:59] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:32:21] (03CR) 10DCausse: Tune the wikidata "language" profile for wbsearchentities (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817317 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [13:32:36] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:32:46] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:32:47] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:34:04] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [13:34:14] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [13:34:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.304 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:34:24] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:34:27] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:34:28] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:34:56] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Tested it on mwdebug1001, seems to work great. Swiss German no longer beats standard German, but can still readily be found using its labe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817317 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [13:35:05] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48391 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:35:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P31997 and previous config saved to /var/cache/conftool/dbconfig/20220727-133511-marostegui.json [13:35:15] dcausse: should I deploy the language tuning right now? [13:35:23] since we have a deploy window at the moment :) [13:36:00] (in the meantime, I’m done testing on mwdebug1001 and wiped my changes using scap pull) [13:36:22] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:38:58] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: puppet avoid use of reduce functions - https://phabricator.wikimedia.org/T313910 (10Peachey88) [13:39:27] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:39:36] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Tune the wikidata "language" profile for wbsearchentities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817317 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [13:39:39] let’s go :) [13:40:40] (03Merged) 10jenkins-bot: Tune the wikidata "language" profile for wbsearchentities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817317 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [13:41:03] Lucas_WMDE: sorry missed your ping, please go ahead :) [13:41:10] \o/ [13:41:24] probably sync IS.php first, then SearchSettingsForWikidata.php [13:41:40] yes [13:41:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:42:04] the two parts in SearchSettings look like they need each other but thankfully they’re in the same file [13:42:07] (03CR) 10Jelto: [C: 03+2] aptrepo: update gitlab-ce & gitlab-runner to 15.0 [puppet] - 10https://gerrit.wikimedia.org/r/817211 (https://phabricator.wikimedia.org/T309062) (owner: 10Jelto) [13:42:15] and we’re only making the profile available tomorrow anyways [13:42:47] tested on mwdebug1001, still seems to work [13:43:23] syncing [13:44:34] (03PS2) 10Eigyan: [config]: Add click event logging for mobile and desktop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812391 (https://phabricator.wikimedia.org/T310852) [13:45:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:46:47] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:817317|Tune the wikidata "language" profile for wbsearchentities (T307869)]] (1/2) (duration: 03m 29s) [13:46:52] T307869: Request for new search profile for Wikidata that boosts Items for languages - https://phabricator.wikimedia.org/T307869 [13:46:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:47:10] (03PS2) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817777 (https://phabricator.wikimedia.org/T313910) [13:49:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:49:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:50:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P31998 and previous config saved to /var/cache/conftool/dbconfig/20220727-135017-marostegui.json [13:50:21] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/SearchSettingsForWikidata.php: Config: [[gerrit:817317|Tune the wikidata "language" profile for wbsearchentities (T307869)]] (2/2) (duration: 03m 21s) [13:50:44] (03PS1) 10Klausman: ML k8s: move articletopic-outlink files to correct subdir [deployment-charts] - 10https://gerrit.wikimedia.org/r/817781 (https://phabricator.wikimedia.org/T313888) [13:50:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:51:17] (03CR) 10Klausman: [C: 03+2] ML k8s: move articletopic-outlink files to correct subdir [deployment-charts] - 10https://gerrit.wikimedia.org/r/817781 (https://phabricator.wikimedia.org/T313888) (owner: 10Klausman) [13:51:30] !log UTC afternoon backport+config window done [13:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:36] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36433/console" [puppet] - 10https://gerrit.wikimedia.org/r/817777 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond) [13:52:18] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Marostegui) Removing the DBA tag as this will only affect A7 and we don't have any DBs there. [13:52:27] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Marostegui) [13:54:05] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:56:35] (03Merged) 10jenkins-bot: ML k8s: move articletopic-outlink files to correct subdir [deployment-charts] - 10https://gerrit.wikimedia.org/r/817781 (https://phabricator.wikimedia.org/T313888) (owner: 10Klausman) [13:59:03] (03PS1) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817783 (https://phabricator.wikimedia.org/T313910) [14:01:58] (03PS1) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) [14:02:14] (03CR) 10Ssingh: [V: 03+1 C: 03+2] trafficserver: 9.x upgrade: remove redundant metrics [puppet] - 10https://gerrit.wikimedia.org/r/803297 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [14:02:36] (03PS2) 10MVernon: swift: stop flinging thumbnails at other DC in rewrite.py [puppet] - 10https://gerrit.wikimedia.org/r/816726 (https://phabricator.wikimedia.org/T313102) [14:03:31] /7 [14:03:35] uff sorry :) [14:04:14] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::cinder: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815681 (owner: 10Majavah) [14:05:05] (03CR) 10MVernon: swift: stop flinging thumbnails at other DC in rewrite.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816726 (https://phabricator.wikimedia.org/T313102) (owner: 10MVernon) [14:05:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T312990)', diff saved to https://phabricator.wikimedia.org/P31999 and previous config saved to /var/cache/conftool/dbconfig/20220727-140523-marostegui.json [14:05:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [14:05:29] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [14:05:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [14:05:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36434/console" [puppet] - 10https://gerrit.wikimedia.org/r/817783 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond) [14:05:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T312990)', diff saved to https://phabricator.wikimedia.org/P32000 and previous config saved to /var/cache/conftool/dbconfig/20220727-140544-marostegui.json [14:06:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36435/console" [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond) [14:10:23] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T312990)', diff saved to https://phabricator.wikimedia.org/P32001 and previous config saved to /var/cache/conftool/dbconfig/20220727-141108-marostegui.json [14:11:12] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [14:13:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36437/console" [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond) [14:15:31] (03PS1) 10Clare Ming: Disable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817785 (https://phabricator.wikimedia.org/T312296) [14:16:35] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/recommendation-api: apply [14:16:55] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply [14:20:33] (03CR) 10Andrew Bogott: [C: 04-1] "I'd like the two pieces of this decoupled: first let's get the new nodes puppetized, running, and clustered before we actually tell openst" [puppet] - 10https://gerrit.wikimedia.org/r/816818 (owner: 10Majavah) [14:20:49] (03PS2) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) [14:21:29] (03PS4) 10Ssingh: trafficserver: 9.x upgrade: replace client.verify.server [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) [14:22:13] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36439/console" [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [14:22:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add cgoubert to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/817767 (https://phabricator.wikimedia.org/T313902) (owner: 10Clément Goubert) [14:22:50] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet [14:22:51] (03CR) 10Andrew Bogott: [C: 04-1] "Oh, and actually, once the new nodes are ready to receive traffic we can move the service rather than adding the additional nodes... will " [puppet] - 10https://gerrit.wikimedia.org/r/816818 (owner: 10Majavah) [14:23:14] !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sretest1002.eqiad.wmnet [14:23:52] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::designate: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815683 (owner: 10Majavah) [14:26:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32002 and previous config saved to /var/cache/conftool/dbconfig/20220727-142614-marostegui.json [14:26:36] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36440/console" [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond) [14:27:30] (03PS3) 10Nskaggs: Expand retry logic for cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/817378 (https://phabricator.wikimedia.org/T310640) [14:27:53] (03PS3) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) [14:28:13] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [14:28:30] (03PS2) 10Eevans: Do not assign CASSANDRA_LOG_DIR from environment config [puppet] - 10https://gerrit.wikimedia.org/r/816805 (https://phabricator.wikimedia.org/T309896) [14:29:10] (03PS1) 10Jbond: do not merge! [puppet] - 10https://gerrit.wikimedia.org/r/817788 [14:30:37] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [14:31:26] 10SRE, 10ops-codfw, 10Machine-Learning-Team: codfw: ml-serve2001 memmory issue DIMM A2 - https://phabricator.wikimedia.org/T313822 (10Papaul) The reboot fixed the DIMM error for now: ` The self-heal operation successfully completed at DIMM DIMM_A2. Wed 27 Jul 2022 09:06:24 The self-heal operation succes... [14:32:59] (03CR) 10Andrew Bogott: "This isn't a concern with this patch exactly, but I'm now thinking that existing trove VMs won't know to switch over to the new rabbit ser" [puppet] - 10https://gerrit.wikimedia.org/r/815723 (owner: 10Majavah) [14:33:07] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:33:11] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:33:39] (03PS4) 10Nskaggs: Expand retry logic for cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/817378 (https://phabricator.wikimedia.org/T310640) [14:34:21] (03PS10) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 [14:34:23] (03PS1) 10Jbond: P:sretest: import blackbox to sretest to check if its just genrally slow [puppet] - 10https://gerrit.wikimedia.org/r/817789 [14:34:33] 10SRE, 10SRE-Access-Requests: Requesting access to WMCS for Slavina Stefanova - https://phabricator.wikimedia.org/T313934 (10Slst2020) [14:35:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36441/console" [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond) [14:36:38] 10SRE, 10ops-codfw: Recycling Pickup for CODFW - https://phabricator.wikimedia.org/T307694 (10Papaul) We received the pre-audit yesterday. The list of server we sent matches the pre-audit. [14:38:50] (03CR) 10CI reject: [V: 04-1] O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 (owner: 10Jbond) [14:41:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32003 and previous config saved to /var/cache/conftool/dbconfig/20220727-144120-marostegui.json [14:41:38] (03PS5) 10Nskaggs: Expand retry logic for cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/817378 (https://phabricator.wikimedia.org/T310640) [14:43:27] 10SRE, 10SRE-Access-Requests: Requesting access to WMCS for Slavina Stefanova - https://phabricator.wikimedia.org/T313934 (10nskaggs) +1 [14:46:52] 10SRE, 10SRE-Access-Requests, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10nskaggs) +1, I approve. @Volans The contract is intended to be ongoing and does not have a defined end date (granted however, the current contract is only for... [14:48:12] 10SRE, 10SRE-Access-Requests, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10nskaggs) [14:48:56] 10SRE, 10SRE-Access-Requests, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10nskaggs) @Raymond_Ndibe @Volans I edited the task to reflect the correct LDAP groups. [14:51:17] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:51:39] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:53:16] (03CR) 10Nskaggs: Expand retry logic for cinder backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817378 (https://phabricator.wikimedia.org/T310640) (owner: 10Nskaggs) [14:56:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T312990)', diff saved to https://phabricator.wikimedia.org/P32005 and previous config saved to /var/cache/conftool/dbconfig/20220727-145626-marostegui.json [14:56:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [14:56:31] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [14:56:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [14:56:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T312990)', diff saved to https://phabricator.wikimedia.org/P32006 and previous config saved to /var/cache/conftool/dbconfig/20220727-145646-marostegui.json [15:02:27] (03PS1) 10AikoChou: ml-services: Add outlink-topic-model isvc to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/817793 (https://phabricator.wikimedia.org/T313888) [15:04:20] 10SRE, 10SRE-Access-Requests, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10Volans) >>! In T313876#8108545, @nskaggs wrote: > +1, I approve. > > @Volans The contract is intended to be ongoing and does not have a defined end date (gran... [15:06:26] (03PS22) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) [15:10:05] (03PS2) 10Jbond: P:sretest: import blackbox to sretest to check if its just genrally slow [puppet] - 10https://gerrit.wikimedia.org/r/817789 [15:20:32] (03PS8) 10Andrea Denisse: Add role::netmon to the netmon1003 instance. [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) [15:21:40] PROBLEM - Zookeeper Server #page on conf1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [15:22:03] this is me ^ [15:22:08] I 'll fix [15:22:10] hello [15:22:14] (03PS1) 10MVernon: hieradata: move all of sessionstore to 3.11.13 [puppet] - 10https://gerrit.wikimedia.org/r/817798 (https://phabricator.wikimedia.org/T309896) [15:22:29] thanks akosiaris, I have ACKed the page [15:22:40] it should recover shortly [15:24:12] RECOVERY - Zookeeper Server #page on conf1005 is OK: PROCS OK: 1 process with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [15:24:37] 10SRE, 10SRE-Access-Requests, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10jbond) @Raymond_Ndibe i noticed the following in the above request > Preferred shell username: raymond Are yuo able to point me to the form, template or link... [15:25:59] (03PS3) 10Jbond: P:sretest: import blackbox to sretest to check if its just genrally slow [puppet] - 10https://gerrit.wikimedia.org/r/817789 [15:27:32] (03CR) 10Elukey: [C: 03+2] ml-services: Add outlink-topic-model isvc to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/817793 (https://phabricator.wikimedia.org/T313888) (owner: 10AikoChou) [15:28:55] RECOVERY - Disk space on aqs1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=aqs1004&var-datasource=eqiad+prometheus/ops [15:29:04] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Switch etcd clients to use conf100[789] [puppet] - 10https://gerrit.wikimedia.org/r/817264 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris) [15:31:17] Lucas_WMDE: My sincere apologies! Something came up and I had to step away from my computer. I should have updated the deployment calendar as I stepped away but forgot. Again, my apologies. I'll reschedule the deployment to reflect reality [15:31:47] phuedx: no problem, it didn’t cost me much ;) [15:32:00] maybe see you tomorrow, then ^^ [15:32:17] It cost me some karma! [15:32:23] oh no D: [15:32:43] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to phab1001/phab2001 for Daniel Duvall (dduvall) - https://phabricator.wikimedia.org/T313831 (10thcipriani) >>! In T313831#8107223, @Volans wrote: > @thcipriani: your approval both as group approver and manager is required here ;) > I'm prep... [15:32:56] Sure. I'll schedule it for tomorrow. I'll also look at what's going on with those Beta wikis in the meantime [15:33:01] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to phab1001/phab2001 for Daniel Duvall (dduvall) - https://phabricator.wikimedia.org/T313831 (10thcipriani) [15:33:26] (03CR) 10Volans: "approved on task" [puppet] - 10https://gerrit.wikimedia.org/r/817707 (https://phabricator.wikimedia.org/T313831) (owner: 10Volans) [15:34:05] (03PS2) 10Volans: admin: add dduvall to phabricator-roots [puppet] - 10https://gerrit.wikimedia.org/r/817707 (https://phabricator.wikimedia.org/T313831) [15:34:56] (03CR) 10Vgutierrez: [C: 03+1] Switch etcd clients to use conf100[789] [puppet] - 10https://gerrit.wikimedia.org/r/817264 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris) [15:35:44] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/817707 (https://phabricator.wikimedia.org/T313831) (owner: 10Volans) [15:38:41] (03PS11) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 [15:38:43] (03PS3) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817777 (https://phabricator.wikimedia.org/T313910) [15:38:45] (03PS2) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817783 (https://phabricator.wikimedia.org/T313910) [15:38:47] (03PS4) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) [15:39:06] (03CR) 10MVernon: [C: 03+2] Do not assign CASSANDRA_LOG_DIR from environment config [puppet] - 10https://gerrit.wikimedia.org/r/816805 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans) [15:39:35] (03CR) 10Volans: [C: 03+2] admin: add dduvall to phabricator-roots [puppet] - 10https://gerrit.wikimedia.org/r/817707 (https://phabricator.wikimedia.org/T313831) (owner: 10Volans) [15:40:08] Emperor: if you got my change too in your puppet-merge feel free to merge it [15:40:41] volans: no, just mine [15:40:55] ack, doing mine then [15:40:55] thx [15:40:56] (03PS1) 10Alexandros Kosiaris: conf100[456]: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817806 (https://phabricator.wikimedia.org/T311407) [15:42:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to phab1001/phab2001 for Daniel Duvall (dduvall) - https://phabricator.wikimedia.org/T313831 (10Volans) [15:43:19] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to phab1001/phab2001 for Daniel Duvall (dduvall) - https://phabricator.wikimedia.org/T313831 (10Volans) 05Open→03Resolved a:03Volans @dduvall all set, patch merged in puppet, will be reflected in the fleet within ~30 minutes. I'm resol... [15:43:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36442/console" [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond) [15:44:02] (03CR) 10Alexandros Kosiaris: [C: 03+2] conf100[456]: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817806 (https://phabricator.wikimedia.org/T311407) (owner: 10Alexandros Kosiaris) [15:45:46] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to phab1001/phab2001 for Daniel Duvall (dduvall) - https://phabricator.wikimedia.org/T313831 (10dduvall) Thank you very much! [15:46:08] !log restarting Cassandra, sessionstore2001, to restore on-disk logging -- T309896 [15:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:14] T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896 [15:48:08] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [15:51:38] !log rolling Cassandra restart, aqs2001-2012, to restore on-disk logging -- T309896 [15:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:44] T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896 [15:54:11] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [15:54:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T312990)', diff saved to https://phabricator.wikimedia.org/P32007 and previous config saved to /var/cache/conftool/dbconfig/20220727-155417-marostegui.json [15:54:22] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [15:57:02] (03PS1) 10Dzahn: phabricator: add phabricator-roots on new phabricator hardware [puppet] - 10https://gerrit.wikimedia.org/r/817811 (https://phabricator.wikimedia.org/T280597) [15:58:18] (03CR) 10JHathaway: [C: 03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond) [15:58:36] (03CR) 10Dzahn: "well, I will also add the role to the new hosts asap.. just have to double check it doesn't add duplicate IPs or something.. so this is no" [puppet] - 10https://gerrit.wikimedia.org/r/817811 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [16:03:07] (03CR) 10BCornwall: [C: 03+2] Icinga: Remove traffic alerts [puppet] - 10https://gerrit.wikimedia.org/r/814894 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [16:03:13] (03PS2) 10BCornwall: Icinga: Remove traffic alerts [puppet] - 10https://gerrit.wikimedia.org/r/814894 (https://phabricator.wikimedia.org/T300723) [16:05:05] (03CR) 10Andrew Bogott: [C: 03+1] Add fnegri's SSH key to authorized keys [labs/private] - 10https://gerrit.wikimedia.org/r/817757 (https://phabricator.wikimedia.org/T312597) (owner: 10FNegri) [16:05:13] 10SRE, 10Observability-Alerting, 10Traffic, 10Patch-For-Review, 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10BCornwall) 05In progress→03Resolved [16:07:05] (03PS2) 10FNegri: Add fnegri's SSH key to authorized keys [labs/private] - 10https://gerrit.wikimedia.org/r/817757 (https://phabricator.wikimedia.org/T312597) [16:07:20] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet [16:07:49] !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sretest1002.eqiad.wmnet [16:08:02] (03CR) 10FNegri: [V: 03+2] Add fnegri's SSH key to authorized keys [labs/private] - 10https://gerrit.wikimedia.org/r/817757 (https://phabricator.wikimedia.org/T312597) (owner: 10FNegri) [16:08:07] (03CR) 10FNegri: [V: 03+2 C: 03+2] Add fnegri's SSH key to authorized keys [labs/private] - 10https://gerrit.wikimedia.org/r/817757 (https://phabricator.wikimedia.org/T312597) (owner: 10FNegri) [16:09:21] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:09:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P32008 and previous config saved to /var/cache/conftool/dbconfig/20220727-160923-marostegui.json [16:10:15] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet [16:22:00] (03PS4) 10Majavah: P:openstack::trove: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815723 [16:22:02] (03PS2) 10Majavah: hieradata: switch traffic to cloudrabbit1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/816818 [16:22:04] (03PS1) 10Majavah: site: install rabbit on cloudrabbit1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/817817 [16:22:27] !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [16:23:47] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36443/console" [puppet] - 10https://gerrit.wikimedia.org/r/817817 (owner: 10Majavah) [16:24:17] (03CR) 10CI reject: [V: 04-1] site: install rabbit on cloudrabbit1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/817817 (owner: 10Majavah) [16:24:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P32009 and previous config saved to /var/cache/conftool/dbconfig/20220727-162429-marostegui.json [16:26:48] (03CR) 10Majavah: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/817817 (owner: 10Majavah) [16:31:05] !log this is a sample log, demonstrating to dhinus [16:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:10] !log rolling Cassandra restart, aqs1010-1015, to restore on-disk logging -- T309896 [16:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:13] T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896 [16:31:32] (03CR) 10Majavah: hieradata: switch traffic to cloudrabbit1001-3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816818 (owner: 10Majavah) [16:32:07] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [16:32:08] !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sretest1002.eqiad.wmnet [16:34:24] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet [16:34:41] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:39:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T312990)', diff saved to https://phabricator.wikimedia.org/P32010 and previous config saved to /var/cache/conftool/dbconfig/20220727-163935-marostegui.json [16:39:40] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [16:39:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [16:39:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [16:39:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 13 hosts with reason: Maintenance [16:40:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 13 hosts with reason: Maintenance [16:40:26] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/816726 (https://phabricator.wikimedia.org/T313102) (owner: 10MVernon) [16:40:38] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [16:42:24] !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sretest1002.eqiad.wmnet [16:43:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [16:44:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [16:44:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T312990)', diff saved to https://phabricator.wikimedia.org/P32011 and previous config saved to /var/cache/conftool/dbconfig/20220727-164425-marostegui.json [16:51:31] (03CR) 10Jdlrobson: [C: 03+1] Disable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817785 (https://phabricator.wikimedia.org/T312296) (owner: 10Clare Ming) [16:52:39] PROBLEM - MegaRAID on ms-be2067 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:52:40] ACKNOWLEDGEMENT - MegaRAID on ms-be2067 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T313952 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:52:47] 10SRE, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T313952 (10ops-monitoring-bot) [16:59:33] PROBLEM - Disk space on ms-be2067 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdc1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2067&var-datasource=codfw+prometheus/ops [17:01:01] PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T312990)', diff saved to https://phabricator.wikimedia.org/P32012 and previous config saved to /var/cache/conftool/dbconfig/20220727-170856-marostegui.json [17:09:02] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [17:10:09] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [17:10:47] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:12:29] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Swift [17:15:45] (03CR) 10Dzahn: gitlab: add reserved service IP 208.80.154.8, point to replica-new (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [17:16:36] (03CR) 10Dzahn: gitlab: add reserved service IP 208.80.154.8, point to replica-new (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [17:20:39] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:21:16] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@1a72195]: switch image_suggestions_manual from _delta to _full [17:21:54] (03PS2) 10Dzahn: gitlab: add reserved service IP 208.80.154.8, point to replica-new [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) [17:22:34] (03PS3) 10Dzahn: gitlab: add reserved service IP 208.80.153.8, point to replica-new [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) [17:23:17] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@1a72195]: switch image_suggestions_manual from _delta to _full (duration: 02m 01s) [17:23:27] (03CR) 10CI reject: [V: 04-1] gitlab: add reserved service IP 208.80.153.8, point to replica-new [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [17:24:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P32013 and previous config saved to /var/cache/conftool/dbconfig/20220727-172402-marostegui.json [17:24:44] (03PS4) 10Dzahn: gitlab: add reserved service IP 208.80.154.8, point to replica-new [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) [17:26:43] (03PS5) 10Dzahn: gitlab: add reserved service IP 208.80.153.8, point to replica-new [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) [17:26:47] (03CR) 10Dzahn: gitlab: add reserved service IP 208.80.153.8, point to replica-new (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [17:28:24] (03PS1) 10Dzahn: admin: add demon to gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/817838 [17:30:37] (03CR) 10Dzahn: "I was first just looking at gerrit-deployers group because it came up in the standup today. Then saw how that includes gerrit-roots and fr" [puppet] - 10https://gerrit.wikimedia.org/r/817838 (owner: 10Dzahn) [17:31:59] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:32:58] (03PS1) 10RobH: updating sku list for new r440 sku [software] - 10https://gerrit.wikimedia.org/r/817839 [17:33:23] (03CR) 10RobH: [C: 03+2] updating sku list for new r440 sku [software] - 10https://gerrit.wikimedia.org/r/817839 (owner: 10RobH) [17:34:04] (03Merged) 10jenkins-bot: updating sku list for new r440 sku [software] - 10https://gerrit.wikimedia.org/r/817839 (owner: 10RobH) [17:39:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P32014 and previous config saved to /var/cache/conftool/dbconfig/20220727-173908-marostegui.json [17:40:02] (03PS2) 10Jcrespo: Initial commit [software/pampinus] - 10https://gerrit.wikimedia.org/r/817294 (https://phabricator.wikimedia.org/T283017) [17:40:43] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install kafka-logging200[45] - https://phabricator.wikimedia.org/T313959 (10RobH) [17:41:02] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install kafka-logging200[45] - https://phabricator.wikimedia.org/T313959 (10RobH) [17:44:06] (03PS1) 10Dzahn: gerrit: add gerrit2002 to list of migration dest hosts [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250) [17:44:33] (PuppetFailure) firing: Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:45:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10RobH) [17:45:22] (03CR) 10Dzahn: "CCers: this is all stuff we did in the past to make migrations easier, yay:" [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [17:45:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10RobH) [17:46:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10RobH) @herron, The ordering task lacked racking details, but since we had all the info for the codfw kafka-logging order already, I was able to figure out most of them.... [17:49:33] (PuppetFailure) firing: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:50:49] Error while evaluating a Resource Statement, Unknown resource type: 'monitoring::alerts::traffic_drop' (file: /etc/puppet/modules/profile/manifests/prometheus/alerts.pp, line: 195, column: 5) on node alert1001.wikimedia.org [17:54:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T312990)', diff saved to https://phabricator.wikimedia.org/P32015 and previous config saved to /var/cache/conftool/dbconfig/20220727-175414-marostegui.json [17:54:20] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [17:56:49] (03PS1) 10Volans: admin: add raymond-ndibe user and to WMCS groups [puppet] - 10https://gerrit.wikimedia.org/r/817843 (https://phabricator.wikimedia.org/T313876) [17:57:27] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10Volans) [17:57:55] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10Volans) I've uploaded the patch, but will wait on @Raymond_Ndibe replies before proceeding. [17:59:25] (03PS1) 10Zabe: prometheus: remove traffic_drop alerts [puppet] - 10https://gerrit.wikimedia.org/r/817844 (https://phabricator.wikimedia.org/T300723) [17:59:50] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: cr2-eqiad:FPC3 partial failure (PIC2/3) - https://phabricator.wikimedia.org/T312745 (10wiki_willy) RMA shipped out by Chris on Tuesday, July 26 >>! In T312745#8088364, @Cmjohnson wrote: > Replaced the line card, and placed the old one in the same... [18:00:04] brennen and jeena: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220727T1800). [18:00:04] brennen and jeena: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220727T1800). [18:00:45] (03CR) 10BCornwall: [C: 03+1] prometheus: remove traffic_drop alerts [puppet] - 10https://gerrit.wikimedia.org/r/817844 (https://phabricator.wikimedia.org/T300723) (owner: 10Zabe) [18:00:56] (03CR) 10BCornwall: [C: 03+2] prometheus: remove traffic_drop alerts [puppet] - 10https://gerrit.wikimedia.org/r/817844 (https://phabricator.wikimedia.org/T300723) (owner: 10Zabe) [18:06:25] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:07:21] 10SRE, 10SRE-Access-Requests: Requesting access to WMCS for Slavina Stefanova - https://phabricator.wikimedia.org/T313934 (10Volans) [18:07:28] o/ - currently blocking on T313836 [18:07:29] T313836: MediaWiki\Extension\Translate\TtmServer\ServiceCreationFailure: Unknown type for name 'Apertium': cxserver - https://phabricator.wikimedia.org/T313836 [18:08:24] 10SRE, 10SRE-Access-Requests: Requesting access to WMCS for Slavina Stefanova - https://phabricator.wikimedia.org/T313934 (10Volans) p:05Triage→03Medium [18:08:59] (03PS1) 10Volans: admin: add sstefanova user and to WMCS groups [puppet] - 10https://gerrit.wikimedia.org/r/817845 (https://phabricator.wikimedia.org/T313934) [18:09:14] (03Abandoned) 10Majavah: discovery: switchover doc to doc1002 [dns] - 10https://gerrit.wikimedia.org/r/744762 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [18:10:36] (03PS1) 10BCornwall: Remove more traffic alerts [puppet] - 10https://gerrit.wikimedia.org/r/817866 (https://phabricator.wikimedia.org/T300723) [18:15:03] brennen, jeena: posting a change shortly that'll decom mw2254, currently a scap proxy -- but just for clarity, it won't be going anywhere yet, you have the conch [18:15:33] (03PS1) 10RLazarus: Decom mw2251-2255,2257,2258 [puppet] - 10https://gerrit.wikimedia.org/r/817869 (https://phabricator.wikimedia.org/T313730) [18:16:17] Thanks rzl [18:16:55] (03CR) 10RLazarus: "Posting now but will merge later today, after depooling and running decom cookbook" [puppet] - 10https://gerrit.wikimedia.org/r/817869 (https://phabricator.wikimedia.org/T313730) (owner: 10RLazarus) [18:17:25] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: Migrate get-raid-status-megacli to Python3 - https://phabricator.wikimedia.org/T313952 (10Volans) p:05Triage→03Medium [18:22:47] (03PS1) 10Jcrespo: dbbackups: Increase es dumps retention from 8 days to 10 days [puppet] - 10https://gerrit.wikimedia.org/r/817871 (https://phabricator.wikimedia.org/T138562) [18:22:57] (03PS2) 10Jcrespo: dbbackups: Increase es dumps retention from 8 days to 10 days [puppet] - 10https://gerrit.wikimedia.org/r/817871 (https://phabricator.wikimedia.org/T138562) [18:23:21] (03PS2) 10Dzahn: gerrit: turn gerrit2002 into a gerrit migration dest host [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250) [18:24:07] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/817871 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [18:25:39] 10SRE, 10ops-codfw, 10DC-Ops, 10masz, 10serviceops: Q1:rack/setup/install new memcached hosts - https://phabricator.wikimedia.org/T313963 (10RobH) [18:25:50] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new memcached hosts - https://phabricator.wikimedia.org/T313963 (10RobH) [18:25:59] (03CR) 10Jcrespo: [C: 03+2] db_inventory: Cleanup zarcillo database grants [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo) [18:26:35] (03CR) 10Jcrespo: [C: 03+2] "wrong patch to +2" [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo) [18:26:47] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo) [18:27:07] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Increase es dumps retention from 8 days to 10 days [puppet] - 10https://gerrit.wikimedia.org/r/817871 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [18:28:03] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new memcached hosts - https://phabricator.wikimedia.org/T313963 (10RobH) @joe, Reassigning this to you per our IRC discussion. Pending needs from you/#serviceops: * Please populate racking details section with hostname, OS and all other f... [18:28:41] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new memcached hosts - https://phabricator.wikimedia.org/T313963 (10RobH) [18:28:55] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313963 (10RobH) [18:29:25] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:29:56] 10SRE, 10serviceops: codfw (2) memcached host service implementation tracking - https://phabricator.wikimedia.org/T313965 (10RobH) [18:31:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313963 (10RobH) [18:31:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10RobH) [18:32:10] 10SRE, 10serviceops: eqiad (2) memcached host service implementation tracking - https://phabricator.wikimedia.org/T313965 (10RobH) [18:33:12] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10RobH) [18:33:39] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10RobH) @joe, Reassigning this to you per our IRC discussion. Pending needs from you/#serviceops: * Please populate racking details section with hostname, OS and all o... [18:35:07] 10SRE, 10serviceops: codfw (2) memcached host service implementation tracking - https://phabricator.wikimedia.org/T313968 (10RobH) [18:35:27] 10SRE, 10serviceops: codfw (2) memcached host service implementation tracking - https://phabricator.wikimedia.org/T313968 (10RobH) [18:36:10] (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/817798 (https://phabricator.wikimedia.org/T309896) (owner: 10MVernon) [18:36:33] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10RobH) [18:36:47] 10SRE, 10serviceops: codfw (2) memcached host service implementation tracking - https://phabricator.wikimedia.org/T313968 (10RobH) [18:37:12] (03PS9) 10Andrea Denisse: Add role::netmon to the netmon1003 instance. [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) [18:37:44] (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/36447/gerrit2002.wikimedia.org/change.gerrit2002.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [18:47:18] (03CR) 10Cwhite: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [18:48:40] (03CR) 10Andrea Denisse: [C: 03+2] Add role::netmon to the netmon1003 instance. [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [18:50:18] (03PS1) 10Andrew Bogott: wikimediacloud.org: add cname records for rabbitmq [dns] - 10https://gerrit.wikimedia.org/r/817877 [18:50:20] (03PS1) 10Andrew Bogott: wikimediacloud.org: use cloudcontrol1006 as the primary openstack endpoint [dns] - 10https://gerrit.wikimedia.org/r/817878 (https://phabricator.wikimedia.org/T313268) [18:51:17] (03CR) 10Andrew Bogott: "Assuming we merge https://gerrit.wikimedia.org/r/c/operations/dns/+/817877, this will need updating to use the service addresses." [puppet] - 10https://gerrit.wikimedia.org/r/816818 (owner: 10Majavah) [18:51:41] (03CR) 10CI reject: [V: 04-1] wikimediacloud.org: add cname records for rabbitmq [dns] - 10https://gerrit.wikimedia.org/r/817877 (owner: 10Andrew Bogott) [18:51:43] (03CR) 10CI reject: [V: 04-1] wikimediacloud.org: use cloudcontrol1006 as the primary openstack endpoint [dns] - 10https://gerrit.wikimedia.org/r/817878 (https://phabricator.wikimedia.org/T313268) (owner: 10Andrew Bogott) [18:53:43] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [18:54:39] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::trove: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815723 (owner: 10Majavah) [18:56:44] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) @ayounsi I moved asw2-c and asw2-d uplink from 0/0 and 0/1 to 1/0 and 1/3 on both router to match codfw. In the future if we have to change row A and ro... [18:58:33] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [19:00:59] (03PS1) 10Andrea Denisse: netmon: Add to Acme-chief's Hieradata. [puppet] - 10https://gerrit.wikimedia.org/r/817883 [19:01:15] (03CR) 10Andrew Bogott: [C: 03+2] site: install rabbit on cloudrabbit1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/817817 (owner: 10Majavah) [19:01:22] (03PS2) 10Andrew Bogott: site: install rabbit on cloudrabbit1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/817817 (owner: 10Majavah) [19:02:23] (03CR) 10Cwhite: [C: 03+1] netmon: Add to Acme-chief's Hieradata. [puppet] - 10https://gerrit.wikimedia.org/r/817883 (owner: 10Andrea Denisse) [19:02:45] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Add to Acme-chief's Hieradata. [puppet] - 10https://gerrit.wikimedia.org/r/817883 (owner: 10Andrea Denisse) [19:09:13] (03PS2) 10Andrew Bogott: wikimediacloud.org: add cname records for rabbitmq [dns] - 10https://gerrit.wikimedia.org/r/817877 [19:10:29] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frdb1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T313607 (10Jgreen) >>! In T313607#8103921, @Volans wrote: > @Jgreen All this for us is managed by the `sre.hosts.decommission` cookbook, that we can't run for your hosts. > I think you sho... [19:11:07] (03PS1) 10Andrea Denisse: netmon: Add netmon instances to Acme-chief's Hieradata for Smokeping [puppet] - 10https://gerrit.wikimedia.org/r/817887 (https://phabricator.wikimedia.org/T309074) [19:11:27] (03CR) 10Cwhite: [C: 03+1] netmon: Add netmon instances to Acme-chief's Hieradata for Smokeping [puppet] - 10https://gerrit.wikimedia.org/r/817887 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [19:11:41] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Add netmon instances to Acme-chief's Hieradata for Smokeping [puppet] - 10https://gerrit.wikimedia.org/r/817887 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [19:11:48] (03CR) 10Andrea Denisse: [V: 03+2 C: 03+2] netmon: Add netmon instances to Acme-chief's Hieradata for Smokeping [puppet] - 10https://gerrit.wikimedia.org/r/817887 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [19:13:50] (03CR) 10Majavah: "Thoughts about not using numerical identifiers and going with, say, rabbitmq-a/b/c instead to reduce any possible confusion about the numb" [dns] - 10https://gerrit.wikimedia.org/r/817877 (owner: 10Andrew Bogott) [19:14:24] (03PS2) 10Andrew Bogott: wikimediacloud.org: use cloudcontrol1006 as the primary openstack endpoint [dns] - 10https://gerrit.wikimedia.org/r/817878 (https://phabricator.wikimedia.org/T313268) [19:15:27] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:17:58] (03CR) 10Dzahn: [C: 03+1] "lgtm, the compiler output is a bit special but that is because currently the puppet run on icinga is broken and this change fixes it again" [puppet] - 10https://gerrit.wikimedia.org/r/817866 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [19:19:07] (03PS4) 10BCornwall: Remove kafka alerting class [puppet] - 10https://gerrit.wikimedia.org/r/817866 (https://phabricator.wikimedia.org/T300723) [19:20:57] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:22:17] (03CR) 10BCornwall: [C: 03+2] Remove kafka alerting class [puppet] - 10https://gerrit.wikimedia.org/r/817866 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [19:26:56] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review, 10User-zeljkofilipin: +2 for pwangai - https://phabricator.wikimedia.org/T313794 (10pwangai) 05Open→03Resolved a:03pwangai I am closing this request to follow instructions stipulated at https://www.mediawiki.org/wiki/Gerrit/Privilege_policy/en#Reques... [19:26:57] (03PS1) 10Andrea Denisse: netmon: Add netmon1003 to scap's hieradata. [puppet] - 10https://gerrit.wikimedia.org/r/817890 (https://phabricator.wikimedia.org/T309074) [19:26:59] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:29:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db119[67] - https://phabricator.wikimedia.org/T313978 (10RobH) [19:29:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db119[67] - https://phabricator.wikimedia.org/T313978 (10RobH) [19:29:48] (03CR) 10Cwhite: [C: 03+1] netmon: Add netmon1003 to scap's hieradata. [puppet] - 10https://gerrit.wikimedia.org/r/817890 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [19:29:57] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Add netmon1003 to scap's hieradata. [puppet] - 10https://gerrit.wikimedia.org/r/817890 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [19:32:33] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10RobH) [19:33:20] (03CR) 10Andrew Bogott: [C: 03+2] wikimediacloud.org: use cloudcontrol1006 as the primary openstack endpoint [dns] - 10https://gerrit.wikimedia.org/r/817878 (https://phabricator.wikimedia.org/T313268) (owner: 10Andrew Bogott) [19:33:36] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10RobH) [19:34:09] !log denisse@deploy1002 Started deploy [librenms/librenms@f049593]: Provision LibreNMS on netmon1003 [19:34:14] !log denisse@deploy1002 Finished deploy [librenms/librenms@f049593]: Provision LibreNMS on netmon1003 (duration: 00m 05s) [19:44:34] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) @Jclark-ctr when you have time, can you plug on : ### CR1-eqiad to front of the patch panel port 0/1: 1 breakout cable and the first break out cable go... [19:45:03] (PuppetFailure) resolved: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:47:29] (03PS1) 10Bartosz Dziewoński: VisualEditor: Allow external link paste on mediawikiwiki, metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817893 (https://phabricator.wikimedia.org/T129546) [19:48:18] (03CR) 10Dzahn: [C: 03+1] "I can confirm these are the servers from the decom ticket, it matches netbox, this change looks correct. You will have to ignore the warni" [puppet] - 10https://gerrit.wikimedia.org/r/817869 (https://phabricator.wikimedia.org/T313730) (owner: 10RLazarus) [19:49:05] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:51:51] ACKNOWLEDGEMENT - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating Andrew Bogott This may be due to my moving the openstack.eqiad1.wikimediacloud.org endpoint, investigating. https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:53:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10RobH) [19:53:56] (03PS1) 10Bartosz Dziewoński: jquery.textSelection: Support more edge cases of document.execCommand [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817847 (https://phabricator.wikimedia.org/T33780) [19:54:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10RobH) [19:54:37] (03PS1) 10Bartosz Dziewoński: jquery.textSelection: Use non-execCommand when we can't focus the field [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817848 (https://phabricator.wikimedia.org/T33780) [19:55:08] (03PS1) 10Bartosz Dziewoński: jquery.textSelection: Use non-execCommand when we can't focus the field [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817849 (https://phabricator.wikimedia.org/T33780) [19:56:52] (03PS1) 10Bartosz Dziewoński: Delay template insertion until after closing the dialog [extensions/TemplateWizard] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817850 (https://phabricator.wikimedia.org/T33780) [19:57:01] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:57:02] (03PS1) 10Bartosz Dziewoński: Delay template insertion until after closing the dialog [extensions/TemplateWizard] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817851 (https://phabricator.wikimedia.org/T33780) [19:58:31] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:00:05] RoanKattouw, Urbanecm, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220727T2000). [20:00:05] koi and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:14] hi [20:00:47] (03PS14) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) [20:01:11] hello [20:01:19] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:04:25] jouncebot: now [20:04:26] For the next 0 hour(s) and 55 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220727T2000) [20:05:42] hi - i can deploy [20:05:49] sorry for being late [20:05:59] (03PS2) 10Clare Ming: ptwiki: Restrict "move" permission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817373 (https://phabricator.wikimedia.org/T313802) (owner: 10Stang) [20:07:03] (03CR) 10Clare Ming: [C: 03+2] ptwiki: Restrict "move" permission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817373 (https://phabricator.wikimedia.org/T313802) (owner: 10Stang) [20:07:31] (03CR) 10CI reject: [V: 04-1] elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson) [20:08:39] (03Merged) 10jenkins-bot: ptwiki: Restrict "move" permission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817373 (https://phabricator.wikimedia.org/T313802) (owner: 10Stang) [20:09:21] koi: ur patch is up on mwdebug1002 - can you check? [20:10:04] (03PS2) 10Clare Ming: VisualEditor: Allow external link paste on mediawikiwiki, metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817893 (https://phabricator.wikimedia.org/T129546) (owner: 10Bartosz Dziewoński) [20:10:12] cjming: tested and LGTM [20:10:19] syncing! [20:11:52] (03CR) 10Clare Ming: [C: 03+2] VisualEditor: Allow external link paste on mediawikiwiki, metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817893 (https://phabricator.wikimedia.org/T129546) (owner: 10Bartosz Dziewoński) [20:12:43] cjming: thanks for deploying. my backports can be synced in any order, or all at once. feel free to +2 them ahead of time because i have a lot of them :( [20:13:04] MatmaRex: sounds good [20:13:50] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:817373|ptwiki: Restrict "move" permission (T313802)]] (duration: 03m 19s) [20:13:54] T313802: Modify 'move' permissions on ptwiki - https://phabricator.wikimedia.org/T313802 [20:14:05] (03Merged) 10jenkins-bot: VisualEditor: Allow external link paste on mediawikiwiki, metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817893 (https://phabricator.wikimedia.org/T129546) (owner: 10Bartosz Dziewoński) [20:14:37] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:15:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:15:15] MatmaRex: 1st config patch on mwdebug1002 if you can test [20:15:33] yeah. looking [20:16:23] (03PS15) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) [20:16:25] (03PS1) 10Brennen Bearnes: SearchTranslationsApi: Change the way we fetch TTM services [extensions/Translate] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817855 (https://phabricator.wikimedia.org/T313836) [20:16:36] (03CR) 10Clare Ming: [C: 03+2] jquery.textSelection: Support more edge cases of document.execCommand [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817847 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński) [20:16:46] cjming: looks good [20:16:54] (03CR) 10Clare Ming: [C: 03+2] jquery.textSelection: Use non-execCommand when we can't focus the field [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817848 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński) [20:17:05] great - syncing config patch [20:17:26] cjming: mind pinging me when you're wrapping up? i'll probably have a backport to go out then and roll the train forward to group1. [20:17:38] brennen: sure thing [20:17:43] thx! [20:17:44] (03PS3) 10Dzahn: gerrit: turn gerrit2002 into a gerrit migration dest host [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250) [20:19:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:19:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:20:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:20:35] (03PS1) 10Andrew Bogott: OpenStack nova: pregenerate db grants for cell 0 [puppet] - 10https://gerrit.wikimedia.org/r/817896 [20:20:43] cjming: actually, if possible, could we do all of the backports at once? they all affect the same feature so it doesn't make much sense to test each individually [20:20:55] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:817893|VisualEditor: Allow external link paste on mediawikiwiki, metawiki (T129546)]] (duration: 03m 37s) [20:21:00] T129546: Support preserving external links in pasted HTML content - https://phabricator.wikimedia.org/T129546 [20:21:24] (03CR) 10CI reject: [V: 04-1] OpenStack nova: pregenerate db grants for cell 0 [puppet] - 10https://gerrit.wikimedia.org/r/817896 (owner: 10Andrew Bogott) [20:22:17] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:22:18] MatmaRex: i think so - i'm unclear if i have to wait for things to merge on the release branches before rebasing [20:23:12] cjming: i just created them, so they should merge cleanly without the need for rebasing [20:23:47] MatmaRex: so you're saying just merge them all at once [20:23:59] yeah [20:24:07] alrighty [20:24:19] (03CR) 10Clare Ming: [C: 03+2] jquery.textSelection: Use non-execCommand when we can't focus the field [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817849 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński) [20:24:24] if you +2 them all, they all should start the tests now, and merge whenever that finishes [20:24:28] (03PS2) 10Andrew Bogott: OpenStack nova: pregenerate db grants for cell 0 [puppet] - 10https://gerrit.wikimedia.org/r/817896 [20:24:48] yup [20:24:53] (03CR) 10Clare Ming: [C: 03+2] Delay template insertion until after closing the dialog [extensions/TemplateWizard] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817850 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński) [20:25:02] (03CR) 10Clare Ming: [C: 03+2] Delay template insertion until after closing the dialog [extensions/TemplateWizard] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817851 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński) [20:25:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:26:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:26:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:27:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:27:31] (03CR) 10CI reject: [V: 04-1] elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson) [20:28:47] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack nova: pregenerate db grants for cell 0 [puppet] - 10https://gerrit.wikimedia.org/r/817896 (owner: 10Andrew Bogott) [20:32:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:32:17] (03PS1) 10Andrew Bogott: Openstack Nova: typo fix for cell db name [puppet] - 10https://gerrit.wikimedia.org/r/817899 [20:34:59] (03Merged) 10jenkins-bot: jquery.textSelection: Support more edge cases of document.execCommand [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817847 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński) [20:35:39] (03Merged) 10jenkins-bot: jquery.textSelection: Use non-execCommand when we can't focus the field [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817848 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński) [20:35:40] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Nova: typo fix for cell db name [puppet] - 10https://gerrit.wikimedia.org/r/817899 (owner: 10Andrew Bogott) [20:37:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:37:08] (03PS2) 10Zabe: wikimedia.org: Add developers.wikimedia.org alias [dns] - 10https://gerrit.wikimedia.org/r/816174 (https://phabricator.wikimedia.org/T313597) [20:38:07] (03PS16) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) [20:39:32] MatmaRex: 817847 + 817848 are up on mwdebug1002 if you want to check since we're still waiting for your other stuff to merge [20:40:08] (03PS17) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) [20:40:18] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:40:40] thanks, i'd have to recheck with the other patches anyway, so i'll just wait if that's okay? [20:40:47] np [20:41:17] (PHPFPMTooBusy) firing: Not enough idle php7.2-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:41:29] PROBLEM - Apache HTTP on mw1422 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:29] PROBLEM - Apache HTTP on mw1421 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:29] PROBLEM - Apache HTTP on mw1363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:29] PROBLEM - Apache HTTP on mw1447 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:34] (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [20:41:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [20:41:35] mhm [20:41:39] PROBLEM - Apache HTTP on mw1428 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:39] acked [20:41:41] here [20:41:43] PROBLEM - Apache HTTP on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:45] PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:47] PROBLEM - Apache HTTP on mw1398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:47] PROBLEM - Apache HTTP on mw1396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:47] PROBLEM - Apache HTTP on mw1400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:51] PROBLEM - Apache HTTP on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:51] PROBLEM - Apache HTTP on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:51] PROBLEM - Apache HTTP on mw1374 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:51] PROBLEM - Apache HTTP on mw1379 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:51] PROBLEM - Apache HTTP on mw1406 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:55] PROBLEM - Apache HTTP on mw1443 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:55] PROBLEM - Apache HTTP on mw1444 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:57] here, trying to ACK alerts but it's many [20:41:57] PROBLEM - Apache HTTP on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:57] PROBLEM - Apache HTTP on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:57] PROBLEM - Apache HTTP on mw1425 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:57] PROBLEM - Apache HTTP on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:57] PROBLEM - Apache HTTP on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:58] PROBLEM - Apache HTTP on mw1402 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:59] PROBLEM - Apache HTTP on mw1404 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:59] PROBLEM - Apache HTTP on mw1412 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:41:59] PROBLEM - Apache HTTP on mw1427 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:42:00] PROBLEM - Apache HTTP on mw1424 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:42:00] PROBLEM - Apache HTTP on mw1408 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:42:01] PROBLEM - Apache HTTP on mw1386 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:42:04] this is the middle of deployment? [20:42:07] cjming: may want to hold your deploy? [20:42:13] gah [20:42:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:42:16] ok [20:42:23] mutante: yes [20:42:25] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POS [20:42:30] ah hm [20:42:33] PROBLEM - Apache HTTP on mw1361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:42:33] PROBLEM - Apache HTTP on mw1362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:42:33] PROBLEM - Apache HTTP on mw1357 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:42:33] PROBLEM - Apache HTTP on mw1375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:42:39] PROBLEM - Apache HTTP on mw1426 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:42:41] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:42:41] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 1189 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:42:43] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [20:42:43] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not fou [20:42:43] nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [20:42:45] PROBLEM - Apache HTTP on mw1360 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:42:45] PROBLEM - Apache HTTP on mw1381 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:42:47] PROBLEM - Apache HTTP on mw1382 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:42:47] PROBLEM - Apache HTTP on mw1358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:42:47] PROBLEM - Apache HTTP on mw1390 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:42:47] sukhe, mutante: maybe move to -sre, less noise [20:42:49] PROBLEM - Apache HTTP on mw1392 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:42:51] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 1941 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:42:51] PROBLEM - Apache HTTP on mw1449 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:42:53] right thanks [20:42:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [20:43:03] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:43:03] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adap [20:43:03] nks to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [20:43:05] PROBLEM - Apache HTTP on mw1376 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:43:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:43:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:43:13] PROBLEM - Apache HTTP on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:43:13] PROBLEM - Apache HTTP on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:43:13] PROBLEM - Apache HTTP on mw1380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:43:15] PROBLEM - Apache HTTP on mw1394 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:43:17] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/source/{title}/{to} ( [20:43:17] a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [20:43:17] PROBLEM - Apache HTTP on mw1450 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:43:17] PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:43:17] PROBLEM - Apache HTTP on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:43:18] can someone lmk if/when it's ok to continue? [20:43:21] PROBLEM - Apache HTTP on mw1388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:43:21] PROBLEM - Apache HTTP on mw1383 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:43:23] PROBLEM - Apache HTTP on mw1448 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:43:25] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:43:29] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:43:29] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:43:33] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/description/{title} (Get description for test page) timed out before a response was received: /{domain}/v1/page/media-list/{title} (Get media list from test page) is CRITICAL: Test Get media list from test page returned the unexpected status 503 (exp [20:43:33] 200): /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpe [20:43:33] tus 503 (expecting: 200): /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:43:36] cjming: join #wikimedia-sre [20:43:37] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [20:43:39] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [20:43:41] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:43:43] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for Apr [20:43:43] 016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{dom [20:43:43] page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [20:43:45] (03Merged) 10jenkins-bot: jquery.textSelection: Use non-execCommand when we can't focus the field [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817849 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński) [20:43:47] (03Merged) 10jenkins-bot: Delay template insertion until after closing the dialog [extensions/TemplateWizard] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817850 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński) [20:43:49] (03Merged) 10jenkins-bot: Delay template insertion until after closing the dialog [extensions/TemplateWizard] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817851 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński) [20:43:51] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:43:51] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:43:51] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:43:59] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:43:59] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:44:03] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - api-https_443: Servers mw1426.eqiad.wmnet, mw1346.eqiad.wmnet, mw1380.eqiad.wmnet, mw1447.eqiad.wmnet, mw1361.eqiad.wmnet, mw1374.eqiad.wmnet, mw1344.eqiad.wmnet, mw1428.eqiad.wmnet, mw1348.eqiad.wmnet, mw1314.eqiad.wmnet, mw1412.eqiad.wmnet, mw1378.eqiad.wmnet, mw1404.eqiad.wmnet, mw1362.eqiad.wmnet, mw1381.eqiad.wmnet, mw1388.eqiad.wmnet, mw1340.eq [20:44:03] t, mw1449.eqiad.wmnet, mw1443.eqiad.wmnet, mw1343.eqiad.wmnet, mw1421.eqiad.wmnet, mw1347.eqiad.wmnet, mw1377.eqiad.wmnet, mw1345.eqiad.wmnet, mw1396.eqiad.wmnet, mw1358.eqiad.wmnet, mw1424.eqiad.wmnet, mw1398.eqiad.wmnet, mw1444.eqiad.wmnet, mw1376.eqiad.wmnet, mw1359.eqiad.wmnet, mw1423.eqiad.wmnet, mw1317.eqiad.wmnet, mw1450.eqiad.wmnet, mw1425.eqiad.wmnet, mw1408.eqiad.wmnet, mw1316.eqiad.wmnet, mw1379.eqiad.wmnet, mw1312.eqiad.wmnet, [20:44:03] eqiad.wmnet, mw1383.eqiad.wmnet, mw1427.eqiad.wmnet, mw1406.eqiad.wmnet, mw1375.eqiad.wmnet, mw1342.eqiad.wmnet, mw1382.eqiad.wmnet, mw1315.eqiad.wmnet, mw1341.eqiad.wmnet, mw1402.eqiad https://wikitech.wikimedia.org/wiki/PyBal [20:44:05] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [20:44:07] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:44:07] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200): [20:44:07] }/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpecte [20:44:07] 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [20:44:07] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not fou [20:44:07] nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Proton [20:44:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:44:11] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:44:29] PROBLEM - Apache HTTP on mw1423 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:44:35] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - api-https_443: Servers mw1422.eqiad.wmnet, mw1426.eqiad.wmnet, mw1346.eqiad.wmnet, mw1380.eqiad.wmnet, mw1448.eqiad.wmnet, mw1447.eqiad.wmnet, mw1394.eqiad.wmnet, mw1361.eqiad.wmnet, mw1392.eqiad.wmnet, mw1374.eqiad.wmnet, mw1344.eqiad.wmnet, mw1428.eqiad.wmnet, mw1388.eqiad.wmnet, mw1314.eqiad.wmnet, mw1386.eqiad.wmnet, mw1348.eqiad.wmnet, mw1396.eq [20:44:35] t, mw1390.eqiad.wmnet, mw1381.eqiad.wmnet, mw1362.eqiad.wmnet, mw1340.eqiad.wmnet, mw1449.eqiad.wmnet, mw1443.eqiad.wmnet, mw1343.eqiad.wmnet, mw1421.eqiad.wmnet, mw1347.eqiad.wmnet, mw1377.eqiad.wmnet, mw1345.eqiad.wmnet, mw1358.eqiad.wmnet, mw1359.eqiad.wmnet, mw1424.eqiad.wmnet, mw1412.eqiad.wmnet, mw1398.eqiad.wmnet, mw1444.eqiad.wmnet, mw1404.eqiad.wmnet, mw1376.eqiad.wmnet, mw1363.eqiad.wmnet, mw1357.eqiad.wmnet, mw1423.eqiad.wmnet, [20:44:35] eqiad.wmnet, mw1450.eqiad.wmnet, mw1425.eqiad.wmnet, mw1408.eqiad.wmnet, mw1316.eqiad.wmnet, mw1379.eqiad.wmnet, mw1312.eqiad.wmnet, mw1400.eqiad.wmnet, mw1402.eqiad.wmnet, mw1383.eqiad https://wikitech.wikimedia.org/wiki/PyBal [20:44:39] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:44:39] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:44:41] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:44:41] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:44:41] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:44:41] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:44:43] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:44:47] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [20:44:47] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/description/{title} (Get description for test page) is CRITICAL: Test Get description for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response [20:44:47] ived: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page [20:44:47] TICAL: Test Get preview mobile HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:44:47] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:44:53] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:44:53] PROBLEM - PHP7 rendering on mw1374 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:44:53] PROBLEM - PHP7 rendering on mw1357 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:44:57] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [20:44:57] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [20:44:59] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:44:59] PROBLEM - PHP7 rendering on mw1412 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:03] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [20:45:05] PROBLEM - Apache HTTP on mw1356 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:45:05] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:05] PROBLEM - restbase endpoints health on restbase2027 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:05] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:05] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:09] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:09] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:09] PROBLEM - PHP7 rendering on mw1447 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:09] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [20:45:09] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:10] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:13] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:15] PROBLEM - PHP7 rendering on mw1356 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:15] PROBLEM - PHP7 rendering on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:15] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:17] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [20:45:17] PROBLEM - Apache HTTP on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:45:17] PROBLEM - Apache HTTP on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:45:17] PROBLEM - Apache HTTP on mw1378 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:45:17] PROBLEM - Apache HTTP on mw1377 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:45:18] PROBLEM - Apache HTTP on mw1359 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:45:18] PROBLEM - PHP7 rendering on mw1386 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:19] PROBLEM - PHP7 rendering on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:19] PROBLEM - PHP7 rendering on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:20] PROBLEM - PHP7 rendering on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:21] PROBLEM - PHP7 rendering on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:23] PROBLEM - PHP7 rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:25] PROBLEM - PHP7 rendering on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:25] PROBLEM - PHP7 rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:25] PROBLEM - PHP7 rendering on mw1392 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:25] PROBLEM - PHP7 rendering on mw1388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:25] PROBLEM - PHP7 rendering on mw1379 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:26] PROBLEM - PHP7 rendering on mw1376 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:26] PROBLEM - PHP7 rendering on mw1449 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:29] PROBLEM - PHP7 rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:31] (03CR) 10Andrew Bogott: wikimediacloud.org: add cname records for rabbitmq (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/817877 (owner: 10Andrew Bogott) [20:45:31] PROBLEM - PHP7 rendering on mw1426 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:31] PROBLEM - PHP7 rendering on mw1424 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:31] PROBLEM - PHP7 rendering on mw1443 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:31] PROBLEM - PHP7 rendering on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:33] PROBLEM - PHP7 rendering on mw1396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:33] PROBLEM - PHP7 rendering on mw1381 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:41] PROBLEM - PHP7 rendering on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:45:45] PROBLEM - PHP7 rendering on mw1427 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:01] PROBLEM - PHP7 rendering on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:01] PROBLEM - PHP7 rendering on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:03] PROBLEM - PHP7 rendering on mw1390 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:09] PROBLEM - PHP7 rendering on mw1448 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:14] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:46:17] PROBLEM - PHP7 rendering on mw1382 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:21] PROBLEM - PHP7 rendering on mw1428 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:21] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:46:21] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:46:21] I got 503 problems [20:46:25] And a 503 is one [20:46:27] PROBLEM - PHP7 rendering on mw1400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:27] PROBLEM - PHP7 rendering on mw1375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:29] PROBLEM - PHP7 rendering on mw1404 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:29] PROBLEM - PHP7 rendering on mw1380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:31] PROBLEM - PHP7 rendering on mw1359 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:35] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [20:46:37] (03CR) 10Aaron Schulz: [C: 03+1] Multi-DC routing special cases for OAuth [puppet] - 10https://gerrit.wikimedia.org/r/817086 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling) [20:46:43] PROBLEM - PHP7 rendering on mw1398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:43] PROBLEM - PHP7 rendering on mw1408 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:43] PROBLEM - PHP7 rendering on mw1423 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:43] PROBLEM - PHP7 rendering on mw1422 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:43] PROBLEM - PHP7 rendering on mw1402 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:45] PROBLEM - PHP7 rendering on mw1444 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:49] RECOVERY - Apache HTTP on mw1313 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 9.276 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:46:49] PROBLEM - PHP7 rendering on mw1358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:49] PROBLEM - PHP7 rendering on mw1360 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:49] PROBLEM - PHP7 rendering on mw1361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:49] PROBLEM - PHP7 rendering on mw1363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:50] PROBLEM - PHP7 rendering on mw1406 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:50] PROBLEM - PHP7 rendering on mw1362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:46:53] RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 8.128 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:47:59] (03CR) 10Bking: [V: 03+2 C: 03+2] Set CORS headers appropriate to WCQS [puppet] - 10https://gerrit.wikimedia.org/r/817387 (https://phabricator.wikimedia.org/T307391) (owner: 10Ebernhardson) [20:48:06] !log sukhe@cumin1001 dbctl commit (dc=all): 'depool db1132', diff saved to https://phabricator.wikimedia.org/P32017 and previous config saved to /var/cache/conftool/dbconfig/20220727-204806-sukhe.json [20:48:21] RECOVERY - Apache HTTP on mw1448 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 9.438 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:48:27] RECOVERY - PHP7 rendering on mw1390 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 8.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:48:29] PROBLEM - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [20:48:29] PROBLEM - PHP7 rendering on mw1383 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:48:31] PROBLEM - PHP7 rendering on mw1377 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:48:31] PROBLEM - PHP7 rendering on mw1394 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:48:31] RECOVERY - PHP7 rendering on mw1448 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 5.370 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:48:43] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [20:48:43] RECOVERY - PHP7 rendering on mw1428 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 6.408 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:48:45] PROBLEM - PHP7 rendering on mw1378 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:48:49] RECOVERY - Apache HTTP on mw1447 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 3.678 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:48:49] RECOVERY - Apache HTTP on mw1421 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 3.963 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:48:49] RECOVERY - Apache HTTP on mw1422 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 4.686 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:48:49] RECOVERY - PHP7 rendering on mw1400 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 5.155 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:48:51] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:48:51] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:48:51] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:48:51] RECOVERY - Apache HTTP on mw1363 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 7.204 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:48:53] RECOVERY - PHP7 rendering on mw1404 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 7.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:48:53] RECOVERY - PHP7 rendering on mw1375 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 7.888 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:48:53] RECOVERY - PHP7 rendering on mw1359 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 5.529 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:48:55] RECOVERY - PHP7 rendering on mw1380 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 8.620 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:48:57] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:48:57] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:48:57] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:48:57] RECOVERY - Apache HTTP on mw1428 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 3.958 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:48:57] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:49:01] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [20:49:01] RECOVERY - PHP7 rendering on mw1423 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.948 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:49:03] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:49:03] RECOVERY - PHP7 rendering on mw1402 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 2.122 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:49:03] RECOVERY - PHP7 rendering on mw1422 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 2.360 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:49:03] RECOVERY - Apache HTTP on mw1312 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 5.842 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:03] RECOVERY - PHP7 rendering on mw1408 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 2.971 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:49:03] RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 4.672 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:05] RECOVERY - PHP7 rendering on mw1398 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 3.780 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:49:05] RECOVERY - Apache HTTP on mw1396 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.469 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:05] RECOVERY - PHP7 rendering on mw1444 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 2.160 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:49:07] RECOVERY - PHP7 rendering on mw1358 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:49:07] RECOVERY - PHP7 rendering on mw1363 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:49:07] RECOVERY - Apache HTTP on mw1400 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 2.028 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:07] RECOVERY - PHP7 rendering on mw1361 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:49:07] RECOVERY - PHP7 rendering on mw1406 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.086 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:49:08] RECOVERY - Apache HTTP on mw1398 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 2.577 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:08] RECOVERY - PHP7 rendering on mw1362 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.785 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:49:09] RECOVERY - Apache HTTP on mw1315 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.711 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:09] RECOVERY - Apache HTTP on mw1379 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:10] RECOVERY - Apache HTTP on mw1406 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:10] RECOVERY - Apache HTTP on mw1374 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 1.836 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:11] RECOVERY - PHP7 rendering on mw1360 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 3.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:49:13] RECOVERY - Apache HTTP on mw1443 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:13] RECOVERY - Apache HTTP on mw1444 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.694 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:13] RECOVERY - Apache HTTP on mw1425 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:13] RECOVERY - Apache HTTP on mw1344 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:15] RECOVERY - Apache HTTP on mw1346 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.349 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:49:15] RECOVERY - Apache HTTP on mw1402 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:15] RECOVERY - Apache HTTP on mw1343 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 1.020 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:17] RECOVERY - Apache HTTP on mw1408 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:17] RECOVERY - Apache HTTP on mw1424 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:17] RECOVERY - Apache HTTP on mw1423 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:17] RECOVERY - Apache HTTP on mw1386 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:17] RECOVERY - Apache HTTP on mw1404 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.505 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:17] RECOVERY - Apache HTTP on mw1412 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.835 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:18] RECOVERY - Apache HTTP on mw1427 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.852 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:19] (03CR) 10Eevans: [C: 03+1] "Insofar as the impact this might have on storage, +1 from me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/816059 (https://phabricator.wikimedia.org/T313496) (owner: 10Tim Starling) [20:49:33] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:49:33] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:49:33] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:49:33] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:49:33] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:49:34] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:49:34] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:49:39] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:49:39] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:49:39] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:49:41] RECOVERY - PHP7 rendering on mw1357 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:49:41] RECOVERY - PHP7 rendering on mw1374 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:49:45] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:49:47] RECOVERY - PHP7 rendering on mw1412 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.566 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:49:49] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:49:49] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:49:49] RECOVERY - Apache HTTP on mw1357 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:49] RECOVERY - Apache HTTP on mw1361 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:49] RECOVERY - Apache HTTP on mw1375 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:50] RECOVERY - Apache HTTP on mw1362 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:53] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:49:53] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:49:55] RECOVERY - Apache HTTP on mw1356 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.165 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:57] RECOVERY - Apache HTTP on mw1426 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:49:57] RECOVERY - PHP7 rendering on mw1447 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:49:57] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [20:50:01] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:50:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:50:03] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:50:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:50:03] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:50:03] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:50:03] RECOVERY - PHP7 rendering on mw1340 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:03] RECOVERY - PHP7 rendering on mw1356 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:05] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:50:05] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:50:05] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:50:05] RECOVERY - restbase endpoints health on restbase2027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:50:05] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:50:06] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:50:06] RECOVERY - Apache HTTP on mw1381 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:07] RECOVERY - Apache HTTP on mw1358 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:07] RECOVERY - Apache HTTP on mw1345 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:08] RECOVERY - Apache HTTP on mw1382 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:08] RECOVERY - Apache HTTP on mw1378 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:09] RECOVERY - Apache HTTP on mw1377 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:09] RECOVERY - Apache HTTP on mw1359 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:10] RECOVERY - Apache HTTP on mw1360 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 1.017 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:10] RECOVERY - Apache HTTP on mw1347 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.291 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:11] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:50:11] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:50:12] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:50:12] RECOVERY - PHP7 rendering on mw1386 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:13] RECOVERY - Apache HTTP on mw1390 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:13] RECOVERY - PHP7 rendering on mw1344 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:14] RECOVERY - PHP7 rendering on mw1345 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:14] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [20:50:15] RECOVERY - Apache HTTP on mw1392 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:15] RECOVERY - PHP7 rendering on mw1343 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.271 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:16] RECOVERY - Apache HTTP on mw1449 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:16] RECOVERY - PHP7 rendering on mw1341 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:17] RECOVERY - PHP7 rendering on mw1312 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.374 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:17] RECOVERY - PHP7 rendering on mw1392 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:17] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) We just had another issue and db1132 (10.6) was the only one affected again. I will scan thru slow queries tomorrow EU... [20:50:18] RECOVERY - PHP7 rendering on mw1388 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:18] RECOVERY - PHP7 rendering on mw1346 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:18] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:50:19] RECOVERY - PHP7 rendering on mw1449 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:19] RECOVERY - PHP7 rendering on mw1379 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:20] RECOVERY - PHP7 rendering on mw1348 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.082 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:20] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:50:21] RECOVERY - PHP7 rendering on mw1376 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 1.427 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:21] RECOVERY - PHP7 rendering on mw1313 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:22] RECOVERY - PHP7 rendering on mw1426 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:22] RECOVERY - PHP7 rendering on mw1443 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:23] RECOVERY - PHP7 rendering on mw1424 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:23] RECOVERY - PHP7 rendering on mw1342 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:24] RECOVERY - PHP7 rendering on mw1396 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:24] RECOVERY - PHP7 rendering on mw1381 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:29] RECOVERY - Apache HTTP on mw1376 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 1.464 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:29] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:50:31] RECOVERY - PHP7 rendering on mw1315 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:35] RECOVERY - Apache HTTP on mw1380 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:35] RECOVERY - Apache HTTP on mw1342 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 1.066 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:35] RECOVERY - Apache HTTP on mw1341 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 1.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:37] RECOVERY - PHP7 rendering on mw1427 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.147 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:37] RECOVERY - Apache HTTP on mw1394 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:39] RECOVERY - Apache HTTP on mw1450 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:39] RECOVERY - Apache HTTP on mw1317 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:39] RECOVERY - Apache HTTP on mw1316 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:41] RECOVERY - Apache HTTP on mw1388 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:43] RECOVERY - Apache HTTP on mw1383 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:50:43] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:50:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:50:51] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:50:53] RECOVERY - PHP7 rendering on mw1316 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:53] RECOVERY - PHP7 rendering on mw1317 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:53] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:50:55] RECOVERY - PHP7 rendering on mw1383 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:55] RECOVERY - PHP7 rendering on mw1377 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:55] RECOVERY - PHP7 rendering on mw1394 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:57] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:51:01] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:51:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:51:01] RECOVERY - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:51:07] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [20:51:07] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [20:51:07] RECOVERY - PHP7 rendering on mw1382 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:51:09] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [20:51:09] RECOVERY - PHP7 rendering on mw1378 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:51:11] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:51:12] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (2) Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [20:51:23] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:51:23] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:51:23] (PHPFPMTooBusy) resolved: Not enough idle php7.2-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:51:24] 10SRE, 10DBA, 10MediaWiki-Action-API, 10Traffic, and 2 others: API not responding (overflow) - https://phabricator.wikimedia.org/T313986 (10RhinosF1) [20:51:31] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:51:31] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [20:51:34] (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [20:51:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [20:51:37] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:51:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [20:52:19] 10SRE, 10DBA, 10MediaWiki-Action-API, 10Traffic, and 2 others: API not responding (overflow) - https://phabricator.wikimedia.org/T313986 (10RhinosF1) {T311106} [20:52:23] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [20:52:39] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:52:39] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:52:49] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:55:25] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [20:55:37] !log sukhe@cumin1001 dbctl commit (dc=all): 'depool db1111', diff saved to https://phabricator.wikimedia.org/P32018 and previous config saved to /var/cache/conftool/dbconfig/20220727-205536-sukhe.json [20:55:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:55:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [20:56:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:56:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:56:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [20:56:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:58:06] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) db1132 and db1111 depooled [20:58:54] 10SRE, 10Data-Persistence (Consultation), 10MediaWiki-Action-API, 10Traffic, and 2 others: API not responding (overflow) - https://phabricator.wikimedia.org/T313986 (10Marostegui) Not sure what's expected from the DBAs here. There's a chain on things that got db1132 overloaded. [20:59:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [21:00:21] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [21:00:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:01:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [21:02:35] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [21:02:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [21:03:25] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [21:04:12] 10SRE: Upload shiny-server .deb to our Buster apt repository - https://phabricator.wikimedia.org/T313989 (10mpopov) [21:05:36] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Python3-Porting: Migrate get-raid-status-megacli to Python3 - https://phabricator.wikimedia.org/T313952 (10Peachey88) [21:05:51] MatmaRex: all your patches should be up on mwdebug1002 -- can you verify? [21:06:20] sorry, my computer crashed, hope i didn't miss anything cjming [21:07:00] MatmaRex: I was just saying all your patches are on mwdebug1002 - can you check? [21:07:17] looking. thanks [21:09:21] cjming: looks good on wmf.21 [21:10:08] amf on wmf.22 too [21:10:12] everything looks fine [21:10:16] and* [21:12:40] cool - syncing them all [21:14:40] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): More public IPs for codfw1dev - https://phabricator.wikimedia.org/T313977 (10Andrew) 185.15.57.12/30 should be enough, so let's start with that. With luck that'll be all we need, and we can leave it as a permanent change. In t... [21:17:41] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.22/resources/src/jquery/jquery.textSelection.js: Backport: [[gerrit:817847|jquery.textSelection: Support more edge cases of document.execCommand (T33780)]] (duration: 03m 10s) [21:18:18] T33780: WikiEditor dialogs kill the undo buffer - https://phabricator.wikimedia.org/T33780 [21:21:12] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.21/resources/src/jquery/jquery.textSelection.js: Backport: [[gerrit:817848|jquery.textSelection: Use non-execCommand when we can't focus the field (T33780)]] (duration: 03m 09s) [21:25:02] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.22/resources/src/jquery/jquery.textSelection.js: Backport: [[gerrit:817849|jquery.textSelection: Use non-execCommand when we can't focus the field (T33780)]] (duration: 03m 22s) [21:25:07] T33780: WikiEditor dialogs kill the undo buffer - https://phabricator.wikimedia.org/T33780 [21:27:19] !log Removing reserved space on sessionstore storage volumes -- T313991 [21:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:23] T313991: Investigate sessionstore Cassandra utilization improvements - https://phabricator.wikimedia.org/T313991 [21:28:56] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.21/extensions/TemplateWizard/resources/ext.TemplateWizard.Dialog.js: Backport: [[gerrit:817850|Delay template insertion until after closing the dialog (T33780)]] (duration: 03m 27s) [21:29:29] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [21:29:32] (03PS1) 10Bearloga: shiny_server: Minimal dependencies [puppet] - 10https://gerrit.wikimedia.org/r/817903 [21:31:55] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [21:32:47] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.22/extensions/TemplateWizard/resources/ext.TemplateWizard.Dialog.js: Backport: [[gerrit:817851|Delay template insertion until after closing the dialog (T33780)]] (duration: 03m 36s) [21:32:53] T33780: WikiEditor dialogs kill the undo buffer - https://phabricator.wikimedia.org/T33780 [21:33:08] (03CR) 10Brennen Bearnes: [C: 03+2] SearchTranslationsApi: Change the way we fetch TTM services [extensions/Translate] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817855 (https://phabricator.wikimedia.org/T313836) (owner: 10Brennen Bearnes) [21:33:10] MatmaRex: all your changes should be live! [21:33:26] thanks! [21:33:26] brennan: all yours [21:33:43] !log end of UTC late backport window [21:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:53] cjming: thanks - going to sync the above Translate patch and train -> group1. [21:54:04] (03Merged) 10jenkins-bot: SearchTranslationsApi: Change the way we fetch TTM services [extensions/Translate] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817855 (https://phabricator.wikimedia.org/T313836) (owner: 10Brennen Bearnes) [21:58:29] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:59:03] !log brennen@deploy1002 Synchronized php-1.39.0-wmf.22/extensions/Translate/src/TtmServer: Backport: [[gerrit:817855|SearchTranslationsApi: Change the way we fetch TTM services (T313836)]] (duration: 03m 19s) [21:59:09] T313836: MediaWiki\Extension\Translate\TtmServer\ServiceCreationFailure: Unknown type for name 'Apertium': cxserver - https://phabricator.wikimedia.org/T313836 [21:59:13] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [22:00:40] (03PS1) 10TrainBranchBot: group1 wikis to 1.39.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817905 (https://phabricator.wikimedia.org/T308075) [22:00:41] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.39.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817905 (https://phabricator.wikimedia.org/T308075) (owner: 10TrainBranchBot) [22:00:57] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:01:43] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817905 (https://phabricator.wikimedia.org/T308075) (owner: 10TrainBranchBot) [22:02:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:03:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:03:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:04:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:04:53] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [22:05:43] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.22 refs T308075 [22:05:47] T308075: 1.39.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T308075 [22:08:47] (03CR) 10Neil P. Quinn-WMF: "Our new entries look right! I just commented about a better place to put them in the existing organization scheme." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817263 (https://phabricator.wikimedia.org/T313633) (owner: 10Sbisson) [22:08:52] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.22 refs T308075 (duration: 03m 08s) [22:09:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:10:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:10:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:11:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:15:17] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:15:47] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:16:09] (03PS1) 10Bearloga: r_lang: Switch from devtools to remotes [puppet] - 10https://gerrit.wikimedia.org/r/817907 [22:26:49] ACKNOWLEDGEMENT - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project Andrew Bogott investigating! https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [22:26:49] ACKNOWLEDGEMENT - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args nova-fullstack Andrew Bogott investigating! https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:32:15] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:35:19] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [22:36:23] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Jclark-ctr) @cmooney curious if this is a switch config that needs to change these are racked in 10g racks but only use 1g ports db1190 E1 35... [22:37:21] (03PS1) 10RLazarus: requestctl: Add a missing f on an f-string [software/conftool] - 10https://gerrit.wikimedia.org/r/817910 [22:37:46] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10Jclark-ctr) moss-be1001 and moss-be1002 have been installed @LSobanski do you have a thanos-be100? host and can we schedule installation [22:49:55] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:54:01] (03CR) 10Dzahn: [V: 04-1] "did not find a value for the name 'profile::gerrit::migration::src_host" [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [22:58:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [22:59:36] denisse|m: ^ that'll need a "sudo keyholder arm" on the new host [23:00:11] (03CR) 10Tim Starling: [C: 03+2] Increase core session expiry to 86400 to match CentralAuth [deployment-charts] - 10https://gerrit.wikimedia.org/r/816059 (https://phabricator.wikimedia.org/T313496) (owner: 10Tim Starling) [23:00:15] and then it will ask for a passphrase to load the key (https://wikitech.wikimedia.org/wiki/Keyholder#Production_passphrases) [23:02:30] mutante: Thanks for the heads-up!! I was about to go to have lunch. Do you think this change could wait for about an hour?? If not, I could do it right now. 🙈 [23:03:02] denisse|m: it can definitely wait an hour [23:03:18] go for lunch, it's late :) [23:03:42] Okay, cool. I'll go for lunch and do that once I'm back. Thanks. :) [23:04:36] (03Merged) 10jenkins-bot: Increase core session expiry to 86400 to match CentralAuth [deployment-charts] - 10https://gerrit.wikimedia.org/r/816059 (https://phabricator.wikimedia.org/T313496) (owner: 10Tim Starling) [23:05:55] (03PS4) 10Dzahn: gerrit: turn gerrit2002 into a gerrit migration dest host [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250) [23:08:03] !log tstarling@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [23:08:13] !log tstarling@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [23:11:52] jouncebot: nowandnext [23:11:52] No deployments scheduled for the next 6 hour(s) and 48 minute(s) [23:11:52] In 6 hour(s) and 48 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220728T0600) [23:12:33] starting a decom that'll involve a scap proxy, any unscheduled deploys let me know so we can avoid a race condition :) [23:13:45] !log rzl@cumin2002 conftool action : set/pooled=no; selector: name=mw225[1-57-8].codfw.wmnet [23:13:51] !log tstarling@deploy1002 helmfile [codfw] START helmfile.d/services/sessionstore: apply [23:14:17] !log tstarling@deploy1002 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [23:14:41] !log tstarling@deploy1002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [23:14:58] !log tstarling@deploy1002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [23:17:33] !log rzl@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on 7 hosts with reason: Decom [23:17:46] !log rzl@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 7 hosts with reason: Decom [23:18:47] !log rzl@cumin2002 conftool action : set/pooled=inactive; selector: name=mw225[1-57-8].codfw.wmnet [23:22:59] (03PS2) 10Tim Starling: Increase $wgObjectCacheSessionExpiry to 86400 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816060 (https://phabricator.wikimedia.org/T313496) [23:23:07] (03CR) 10Tim Starling: [C: 03+2] Increase $wgObjectCacheSessionExpiry to 86400 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816060 (https://phabricator.wikimedia.org/T313496) (owner: 10Tim Starling) [23:24:35] (03Merged) 10jenkins-bot: Increase $wgObjectCacheSessionExpiry to 86400 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816060 (https://phabricator.wikimedia.org/T313496) (owner: 10Tim Starling) [23:26:06] !log rzl@cumin2002 START - Cookbook sre.hosts.decommission for hosts mw[2251-2255,2257-2258].codfw.wmnet [23:27:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:28:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:28:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:29:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:29:18] !log tstarling@deploy1002 Synchronized wmf-config/CommonSettings.php: increase wgObjectCacheSessionExpiry to 86400 (duration: 03m 30s) [23:34:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:35:07] (03PS2) 10Tim Starling: Move CentralAuth sessions to Kask [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816061 (https://phabricator.wikimedia.org/T313496) [23:35:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:35:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:36:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:36:49] (03CR) 10Tim Starling: [C: 03+2] Move CentralAuth sessions to Kask [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816061 (https://phabricator.wikimedia.org/T313496) (owner: 10Tim Starling) [23:37:45] (03Merged) 10jenkins-bot: Move CentralAuth sessions to Kask [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816061 (https://phabricator.wikimedia.org/T313496) (owner: 10Tim Starling) [23:38:23] !log rzl@cumin2002 START - Cookbook sre.dns.netbox [23:41:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:42:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:42:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:43:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:43:58] (03PS1) 10Dduvall: scap: Deploy configuration using scap3 templates [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817914 (https://phabricator.wikimedia.org/T313950) [23:45:11] !log rzl@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:45:12] !log rzl@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[2251-2255,2257-2258].codfw.wmnet [23:45:51] !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: move CentralAuth sessions to Kask T313496 (duration: 05m 34s) [23:45:55] T313496: Harmonize CentralAuth and core session TTL and migrate CentralAuth sessions to Kask - https://phabricator.wikimedia.org/T313496 [23:46:54] (03PS2) 10RLazarus: Decom mw2251-2255,2257,2258 [puppet] - 10https://gerrit.wikimedia.org/r/817869 (https://phabricator.wikimedia.org/T313730) [23:47:01] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:47:11] (03Abandoned) 10Dduvall: scap: Deploy configuration using scap3 templates [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817914 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [23:47:15] (03PS1) 10Dduvall: scap: Deploy configuration using scap3 templates [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817915 (https://phabricator.wikimedia.org/T313950) [23:48:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:48:56] (03CR) 10RLazarus: [C: 03+2] Decom mw2251-2255,2257,2258 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817869 (https://phabricator.wikimedia.org/T313730) (owner: 10RLazarus) [23:49:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:49:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:49:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:59:08] !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: sync again now that scap proxy list is fixed T313730 T313496 (duration: 03m 25s) [23:59:15] T313496: Harmonize CentralAuth and core session TTL and migrate CentralAuth sessions to Kask - https://phabricator.wikimedia.org/T313496 [23:59:15] T313730: decommission mw2251-mw2255, mw2257-mw2258 - https://phabricator.wikimedia.org/T313730