[00:00:47] 10SRE, 10SRE-OnFire, 10Shellbox, 10serviceops, 10Sustainability (Incident Followup): Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10RLazarus) With https://gerrit.wikimedia.org/r/813924 we ought to see smaller bursts in utilization, so I'm going to tentatively crank the shellb... [00:01:34] (03PS1) 10RLazarus: shellbox: Restore replicas to 8, now that T312319 is resolved. [deployment-charts] - 10https://gerrit.wikimedia.org/r/816873 (https://phabricator.wikimedia.org/T310557) [00:11:06] !log restarted php7.2-fpm on the 9 canary hosts in eqiad T313770 [00:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:14] T313770: Some traffic seems to be reaching 1.39.0-wmf.19 code - https://phabricator.wikimedia.org/T313770 [00:26:13] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Papaul) [00:36:43] 10SRE, 10Security-Team, 10Security: Host crossdomain.xml master policy file - https://phabricator.wikimedia.org/T75574 (10tstarling) 05Open→03Declined [01:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220726T0100) [01:19:08] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:37:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) resolved: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:05:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:07:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.22 [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/816880 [02:07:51] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.22 [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/816880 (owner: 10TrainBranchBot) [02:08:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:08:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:09:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:23:58] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.22 [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/816880 (owner: 10TrainBranchBot) [02:28:10] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:29:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:30:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:30:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:30:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:54:36] PROBLEM - Disk space on aqs1004 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra-a 108310 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=aqs1004&var-datasource=eqiad+prometheus/ops [04:11:36] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:26:28] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:43:30] (03PS1) 10Tim Starling: Set cache types for OAuth multi-DC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816884 (https://phabricator.wikimedia.org/T313578) [04:44:56] (03PS2) 10Tim Starling: Set cache types for OAuth multi-DC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816884 (https://phabricator.wikimedia.org/T313578) [04:46:08] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:46:41] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Marostegui) [04:46:58] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Marostegui) [04:47:29] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Marostegui) [04:52:54] (03PS1) 10Abijeet Patro: TranslationStashActionApi: Fix incorrect constructor dependencies [extensions/Translate] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/816778 (https://phabricator.wikimedia.org/T312008) [05:10:12] (03PS1) 10Tim Starling: Multi-DC routing special cases for OAuth [puppet] - 10https://gerrit.wikimedia.org/r/817086 (https://phabricator.wikimedia.org/T313578) [05:13:59] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown: Varnish: Mobile site redirect interferes with OAuth authorization process - https://phabricator.wikimedia.org/T74186 (10tstarling) Note that I'm baking this bug into the ATS config in [[https://gerrit.wikimedia.org/r/c/operations/puppet/+/817086|gerrit 81708... [05:43:47] 10SRE, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10ayounsi) After talking to Willy I also granted `dcim | device`, the reasoning is that it would take considerable efforts to be able to pull... [05:49:04] (03PS1) 10Ayounsi: Netbox: Allow login from NDA users [puppet] - 10https://gerrit.wikimedia.org/r/817087 (https://phabricator.wikimedia.org/T302870) [05:49:56] (03CR) 10CI reject: [V: 04-1] Netbox: Allow login from NDA users [puppet] - 10https://gerrit.wikimedia.org/r/817087 (https://phabricator.wikimedia.org/T302870) (owner: 10Ayounsi) [05:50:20] (03PS4) 10Giuseppe Lavagetto: mediawiki: allow forcing the backend for blank page on wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/810312 (https://phabricator.wikimedia.org/T311386) [05:51:25] (03PS1) 10Ayounsi: sretest: set correct partman [puppet] - 10https://gerrit.wikimedia.org/r/817088 [05:52:15] (03PS2) 10Ayounsi: Netbox: Allow login from NDA users [puppet] - 10https://gerrit.wikimedia.org/r/817087 (https://phabricator.wikimedia.org/T302870) [05:52:26] (03PS3) 10Ayounsi: Netbox: Allow login from NDA users [puppet] - 10https://gerrit.wikimedia.org/r/817087 (https://phabricator.wikimedia.org/T302870) [05:52:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: allow forcing the backend for blank page on wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/810312 (https://phabricator.wikimedia.org/T311386) (owner: 10Giuseppe Lavagetto) [05:53:51] (03CR) 10Ayounsi: [C: 03+2] sretest: set correct partman [puppet] - 10https://gerrit.wikimedia.org/r/817088 (owner: 10Ayounsi) [05:55:01] (03PS1) 10KartikMistry: Update cxserver to 2022-07-25-080850-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/817089 (https://phabricator.wikimedia.org/T309577) [06:00:05] kormat, marostegui, and Amir1: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220726T0600). [06:07:44] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [06:09:36] (03PS1) 10Giuseppe Lavagetto: mediawiki: properly quote rewritecond [puppet] - 10https://gerrit.wikimedia.org/r/817090 [06:09:51] (03PS2) 10Giuseppe Lavagetto: mediawiki: properly quote rewritecond [puppet] - 10https://gerrit.wikimedia.org/r/817090 [06:09:56] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mediawiki: properly quote rewritecond [puppet] - 10https://gerrit.wikimedia.org/r/817090 (owner: 10Giuseppe Lavagetto) [06:11:57] (03PS2) 10Ayounsi: Add Python 3.10 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/816702 [06:15:28] (03PS1) 10Giuseppe Lavagetto: mediawiki: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/817092 [06:15:44] (03PS2) 10Giuseppe Lavagetto: mediawiki: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/817092 [06:15:51] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mediawiki: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/817092 (owner: 10Giuseppe Lavagetto) [06:21:53] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1001.eqiad.wmnet with OS bullseye [06:24:08] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [06:29:10] (03PS4) 10Jcrespo: bacula: Setup backup[12]00[89] as new production and database backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/816120 (https://phabricator.wikimedia.org/T313582) [06:29:12] (03PS1) 10Jcrespo: db_inventory: Cleanup zarcillo database grants [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) [06:29:23] (03PS2) 10Jcrespo: db_inventory: Cleanup zarcillo database grants [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) [06:31:10] (03CR) 10Ayounsi: geodns: Map out African countries by DC latency (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/816028 (https://phabricator.wikimedia.org/T311472) (owner: 10BCornwall) [06:33:14] (03CR) 10CI reject: [V: 04-1] db_inventory: Cleanup zarcillo database grants [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo) [06:33:33] (03CR) 10CI reject: [V: 04-1] db_inventory: Cleanup zarcillo database grants [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo) [06:34:02] (03CR) 10Jcrespo: "This is my proposal for db_inventory (zarcillo + orchestrator) after a grant cleanup." [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo) [06:34:22] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:35:49] (03PS3) 10Jcrespo: db_inventory: Cleanup zarcillo database grants [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) [06:35:59] (03PS4) 10Jcrespo: db_inventory: Cleanup zarcillo database grants [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) [06:36:46] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [06:40:22] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [06:54:25] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye [06:58:11] <_joe_> !log upgrade all of codfw to python3-poolcounter 0.0.3 T310835 [06:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:16] T310835: Scap pool counter error while backporting - https://phabricator.wikimedia.org/T310835 [06:59:57] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi https://phabricator.wikimedia.org/T313783 - The acknowledgement expires at: 2022-07-27 06:59:31. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:59:57] ACKNOWLEDGEMENT - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi https://phabricator.wikimedia.org/T313783 - The acknowledgement expires at: 2022-07-27 06:59:31. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:05] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220726T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:28] Window cancelled anyway [07:00:39] https://phabricator.wikimedia.org/T313770 [07:30:22] <_joe_> !log running a restart-all for php-fpm on appservers in codfw to test python-poolcounter 0.0.3 T310835 [07:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:28] T310835: Scap pool counter error while backporting - https://phabricator.wikimedia.org/T310835 [07:35:46] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:41:33] !log rolling restart of ats-be on cp[1080,1083,1085,1087,5006,6001,6006,6009,6011,6015] [07:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31902 and previous config saved to /var/cache/conftool/dbconfig/20220726-074717-root.json [07:47:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31903 and previous config saved to /var/cache/conftool/dbconfig/20220726-074721-root.json [07:47:26] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) I have enabled slow query log to get queries slower than 30 seconds on db1132 (s1) and db1111 (s8) and I am going to re... [07:48:27] <_joe_> !log deploy python3-poolcounter everywhere T310835 [07:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:33] T310835: Scap pool counter error while backporting - https://phabricator.wikimedia.org/T310835 [07:56:13] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/814848 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [07:57:44] (03PS1) 10Marostegui: db2088: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817187 (https://phabricator.wikimedia.org/T311493) [08:00:19] (03CR) 10Marostegui: [C: 03+2] db2088: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817187 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [08:02:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31906 and previous config saved to /var/cache/conftool/dbconfig/20220726-080221-root.json [08:02:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31907 and previous config saved to /var/cache/conftool/dbconfig/20220726-080225-root.json [08:05:09] 10SRE, 10SRE-Access-Requests: Requesting access to maintenance servers for mfossati - https://phabricator.wikimedia.org/T313706 (10Volans) @mfossati I just noticed that you already have shell access, as it was granted in T299343. Please mention it in subsequent requests for group membership changes as the form... [08:05:28] (03PS4) 10Giuseppe Lavagetto: lvs: check php 7.4 too on all appservers [puppet] - 10https://gerrit.wikimedia.org/r/810313 (https://phabricator.wikimedia.org/T311386) [08:07:38] (03PS1) 10Volans: admin: add mfossati to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/817191 (https://phabricator.wikimedia.org/T313706) [08:08:39] (03PS1) 10Marostegui: mariadb: Productionize db2170 [puppet] - 10https://gerrit.wikimedia.org/r/817192 (https://phabricator.wikimedia.org/T311493) [08:09:25] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2170 [puppet] - 10https://gerrit.wikimedia.org/r/817192 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [08:10:36] (03PS5) 10Giuseppe Lavagetto: lvs: check php 7.4 too on all appservers [puppet] - 10https://gerrit.wikimedia.org/r/810313 (https://phabricator.wikimedia.org/T311386) [08:12:15] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10fgiunchedi) >>! In T211661#8102237, @ori wrote: > I see! Do we need to employ any of these strategies, then? What (if anything) s... [08:13:33] (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817194 (https://phabricator.wikimedia.org/T313401) [08:14:20] (03PS1) 10Marostegui: pc1014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817195 (https://phabricator.wikimedia.org/T313401) [08:14:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you !" [puppet] - 10https://gerrit.wikimedia.org/r/807176 (owner: 10Dzahn) [08:15:20] (03CR) 10Marostegui: [C: 03+2] pc1014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817195 (https://phabricator.wikimedia.org/T313401) (owner: 10Marostegui) [08:17:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31908 and previous config saved to /var/cache/conftool/dbconfig/20220726-081725-root.json [08:17:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31909 and previous config saved to /var/cache/conftool/dbconfig/20220726-081729-root.json [08:18:07] (03CR) 10MVernon: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817194 (https://phabricator.wikimedia.org/T313401) (owner: 10Marostegui) [08:19:32] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:19:47] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:30:57] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817194 (https://phabricator.wikimedia.org/T313401) (owner: 10Marostegui) [08:30:57] !log kevinbazira@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [08:30:57] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [08:30:57] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817194 (https://phabricator.wikimedia.org/T313401) (owner: 10Marostegui) [08:30:58] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [08:30:58] (03CR) 10Alexandros Kosiaris: [C: 03+2] configcluster: Turn-off zookeeper version pin [puppet] - 10https://gerrit.wikimedia.org/r/813233 (owner: 10Alexandros Kosiaris) [08:30:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:31:04] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Tested locally on a mw node, seems to work as expected." [puppet] - 10https://gerrit.wikimedia.org/r/816810 (owner: 10Alexandros Kosiaris) [08:31:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:31:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:31:41] !log marostegui@deploy1002 Synchronized wmf-config/ProductionServices.php: Promote pc1014 to pc3 master (duration: 03m 13s) [08:31:55] !log Promote pc1014 to pc3 master T313401 [08:32:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31911 and previous config saved to /var/cache/conftool/dbconfig/20220726-083229-root.json [08:32:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:32:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31912 and previous config saved to /var/cache/conftool/dbconfig/20220726-083233-root.json [08:33:29] did something happen to stashbot? I don’t see a reply to marostegui’s log, nor do I see it in the SAL [08:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:55] T313401: Move pc1014 from pc2 to pc3 - https://phabricator.wikimedia.org/T313401 [08:34:00] ah, there we go [08:34:04] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:37:34] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:40:29] !log ayounsi@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1020 [08:40:56] RECOVERY - Host ganeti1020 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [08:41:00] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1020 [08:41:34] RECOVERY - Host cuminunpriv1001 is UP: PING OK - Packet loss = 0%, RTA = 1.84 ms [08:43:28] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui) [08:44:24] (03PS1) 10Elukey: prometheus: add config for the k8s ml-staging codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/817201 (https://phabricator.wikimedia.org/T272918) [08:45:27] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui) pc1013 is no longer a master in pc3. All the tasks owned by #dba team have been completed. Though, we'd appreciate a heads up before the maintenace t... [08:45:40] (03CR) 10Klausman: [C: 03+1] prometheus: add config for the k8s ml-staging codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/817201 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [08:47:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31913 and previous config saved to /var/cache/conftool/dbconfig/20220726-084733-root.json [08:47:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31914 and previous config saved to /var/cache/conftool/dbconfig/20220726-084737-root.json [08:50:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [08:50:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [08:55:54] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:57:37] (03PS5) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 [09:02:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31915 and previous config saved to /var/cache/conftool/dbconfig/20220726-090237-root.json [09:02:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31916 and previous config saved to /var/cache/conftool/dbconfig/20220726-090241-root.json [09:07:03] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frdb1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T313607 (10Volans) @Jgreen All this for us is managed by the `sre.hosts.decommission` cookbook, that we can't run for your hosts. I think you should check with @wiki_willy and #dc-ops for... [09:07:52] (03CR) 10CI reject: [V: 04-1] PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [09:13:21] !log manually restarting php on MW canaries: cumin 'A:mw-canary' 'restart-php-fpm-all' [09:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:41] (03PS11) 10Jelto: gitlab_runner: Handle changes to runner config [puppet] - 10https://gerrit.wikimedia.org/r/815769 (https://phabricator.wikimedia.org/T311746) (owner: 10Dduvall) [09:21:03] !log jnuche@deploy1002 Installing scap version "4.11.3" for 1 hosts [09:21:04] !log jnuche@deploy1002 Installation of scap version "4.11.3" completed for 1 hosts [09:21:11] (03PS1) 10Giuseppe Lavagetto: scap: restart php on 'scap pull' [puppet] - 10https://gerrit.wikimedia.org/r/817206 (https://phabricator.wikimedia.org/T313770) [09:21:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:22:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:22:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T312990)', diff saved to https://phabricator.wikimedia.org/P31917 and previous config saved to /var/cache/conftool/dbconfig/20220726-092217-marostegui.json [09:22:26] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [09:22:40] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/816702 (owner: 10Ayounsi) [09:23:47] (03CR) 10Volans: [C: 03+1] "LGTM, no concerns about it." [puppet] - 10https://gerrit.wikimedia.org/r/816824 (owner: 10Ayounsi) [09:24:51] (03CR) 10Marostegui: db_inventory: Cleanup zarcillo database grants (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo) [09:25:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T312990)', diff saved to https://phabricator.wikimedia.org/P31918 and previous config saved to /var/cache/conftool/dbconfig/20220726-092555-marostegui.json [09:26:15] (03CR) 10Vgutierrez: [C: 03+1] aptrepo: add a component for ATS 9.x [puppet] - 10https://gerrit.wikimedia.org/r/816801 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [09:27:05] (03CR) 10Jaime Nuche: scap: restart php on 'scap pull' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817206 (https://phabricator.wikimedia.org/T313770) (owner: 10Giuseppe Lavagetto) [09:28:44] (03CR) 10Jcrespo: "Fixing..." [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo) [09:28:53] (03CR) 10Giuseppe Lavagetto: scap: restart php on 'scap pull' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817206 (https://phabricator.wikimedia.org/T313770) (owner: 10Giuseppe Lavagetto) [09:29:42] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: restart php on 'scap pull' [puppet] - 10https://gerrit.wikimedia.org/r/817206 (https://phabricator.wikimedia.org/T313770) (owner: 10Giuseppe Lavagetto) [09:30:08] (03CR) 10Jaime Nuche: [C: 03+1] scap: restart php on 'scap pull' [puppet] - 10https://gerrit.wikimedia.org/r/817206 (https://phabricator.wikimedia.org/T313770) (owner: 10Giuseppe Lavagetto) [09:30:35] (03PS1) 10Marostegui: mariadb: Decommission db2085 [puppet] - 10https://gerrit.wikimedia.org/r/817207 (https://phabricator.wikimedia.org/T313239) [09:31:33] <_joe_> !log running puppet on the mw-canary hosts T313770 [09:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:37] T313770: scap no longer restarts php-fpm on canary servers - https://phabricator.wikimedia.org/T313770 [09:32:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2085.codfw.wmnet [09:35:55] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2085 [puppet] - 10https://gerrit.wikimedia.org/r/817207 (https://phabricator.wikimedia.org/T313239) (owner: 10Marostegui) [09:36:35] (03PS5) 10Jcrespo: db_inventory: Cleanup zarcillo database grants [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) [09:36:44] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/817087 (https://phabricator.wikimedia.org/T302870) (owner: 10Ayounsi) [09:36:49] (03PS6) 10Jcrespo: db_inventory: Cleanup zarcillo database grants [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) [09:36:54] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [09:37:17] (03CR) 10Jcrespo: db_inventory: Cleanup zarcillo database grants (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo) [09:38:17] (03CR) 10Jcrespo: "I will wait for Amir to be around and coordinate with both of you before deleting any grant." [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo) [09:39:22] (03CR) 10Marostegui: "Let's wait for him indeed" [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo) [09:40:47] !log oblivian@deploy1002 Synchronized README: testing fix for php restarts (duration: 02m 54s) [09:40:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:40:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2085.codfw.wmnet [09:40:59] 10ops-codfw, 10decommission-hardware: decommission db2085 - https://phabricator.wikimedia.org/T313239 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2085.codfw.wmnet` - db2085.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Found phys... [09:41:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P31920 and previous config saved to /var/cache/conftool/dbconfig/20220726-094100-marostegui.json [09:41:02] 10ops-codfw, 10decommission-hardware: decommission db2085 - https://phabricator.wikimedia.org/T313239 (10Marostegui) a:03Papaul [09:41:08] 10ops-codfw, 10decommission-hardware: decommission db2085 - https://phabricator.wikimedia.org/T313239 (10Marostegui) Ready for you Papaul! [09:41:15] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to maintenance servers for mfossati - https://phabricator.wikimedia.org/T313706 (10mfossati) >>! In T313706#8103781, @Volans wrote: > @mfossati I just noticed that you already have shell access, as it was granted in T299343. Please mention i... [09:41:16] (03PS1) 10MVernon: swift: drain older systems, bring some new ones online [puppet] - 10https://gerrit.wikimedia.org/r/817209 (https://phabricator.wikimedia.org/T294549) [09:42:11] (03PS1) 10Klausman: ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [09:42:49] (03CR) 10CI reject: [V: 04-1] ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [09:43:45] (03PS1) 10Jelto: aptrepo: update gitlab-ce & gitlab-runner to 15.0 [puppet] - 10https://gerrit.wikimedia.org/r/817211 (https://phabricator.wikimedia.org/T309062) [09:44:38] (03CR) 10MVernon: swift: stop flinging thumbnails at other DC in rewrite.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816726 (https://phabricator.wikimedia.org/T313102) (owner: 10MVernon) [09:45:10] 10SRE, 10Patch-For-Review: uwsgi socket/UDP logger is broken if no other logger uses the same format - https://phabricator.wikimedia.org/T312550 (10elukey) Important note: we moved from 2.0.14+20161117-3+deb9u2+wmf1 (custom version on wikimedia-stretch) to 2.0.18-1 (upstream version on Debian Buster). [09:45:25] (03CR) 10MVernon: tlsproxy: manage ssl_ecdhe_curve internally (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789559 (https://phabricator.wikimedia.org/T307510) (owner: 10Gehel) [09:49:25] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:51:41] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:56:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P31921 and previous config saved to /var/cache/conftool/dbconfig/20220726-095605-marostegui.json [10:00:46] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:07:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36382/console" [puppet] - 10https://gerrit.wikimedia.org/r/816850 (owner: 10Jbond) [10:07:17] (03CR) 10Ayounsi: [C: 03+2] Netbox: Allow login from NDA users [puppet] - 10https://gerrit.wikimedia.org/r/817087 (https://phabricator.wikimedia.org/T302870) (owner: 10Ayounsi) [10:10:26] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:ssh::client: Handle case where aliases not set [puppet] - 10https://gerrit.wikimedia.org/r/816850 (owner: 10Jbond) [10:11:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T312990)', diff saved to https://phabricator.wikimedia.org/P31922 and previous config saved to /var/cache/conftool/dbconfig/20220726-101110-marostegui.json [10:11:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [10:11:15] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [10:11:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [10:11:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T312990)', diff saved to https://phabricator.wikimedia.org/P31923 and previous config saved to /var/cache/conftool/dbconfig/20220726-101130-marostegui.json [10:12:46] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36384/console" [puppet] - 10https://gerrit.wikimedia.org/r/817207 (https://phabricator.wikimedia.org/T313239) (owner: 10Marostegui) [10:13:34] (03PS2) 10Klausman: ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [10:13:39] 10SRE, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10ayounsi) a:03ayounsi @taavi let me know if that works as expected on netbox.wikimedia.org and feel free to close the task if so. [10:14:02] (03CR) 10Ayounsi: [C: 03+2] Add Python 3.10 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/816702 (owner: 10Ayounsi) [10:14:15] (03CR) 10CI reject: [V: 04-1] ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [10:14:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T312990)', diff saved to https://phabricator.wikimedia.org/P31924 and previous config saved to /var/cache/conftool/dbconfig/20220726-101446-marostegui.json [10:15:47] 10SRE, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10taavi) 05Open→03Resolved looks good, thank you! [10:17:07] (03PS3) 10Klausman: ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [10:17:37] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36385/console" [puppet] - 10https://gerrit.wikimedia.org/r/817207 (https://phabricator.wikimedia.org/T313239) (owner: 10Marostegui) [10:17:56] (03CR) 10CI reject: [V: 04-1] ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [10:18:09] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36386/console" [puppet] - 10https://gerrit.wikimedia.org/r/817207 (https://phabricator.wikimedia.org/T313239) (owner: 10Marostegui) [10:21:28] (03Merged) 10jenkins-bot: Add Python 3.10 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/816702 (owner: 10Ayounsi) [10:22:31] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36387/console" [puppet] - 10https://gerrit.wikimedia.org/r/815769 (https://phabricator.wikimedia.org/T311746) (owner: 10Dduvall) [10:23:04] (03PS4) 10Klausman: ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [10:23:54] (03CR) 10CI reject: [V: 04-1] ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [10:23:56] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36388/console" [puppet] - 10https://gerrit.wikimedia.org/r/817207 (https://phabricator.wikimedia.org/T313239) (owner: 10Marostegui) [10:24:49] (03PS2) 10Jbond: compiler_debug: fix output formating [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/816839 [10:25:01] (03CR) 10Jbond: [C: 03+2] compiler_debug: fix output formating [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/816839 (owner: 10Jbond) [10:26:32] (03Merged) 10jenkins-bot: compiler_debug: fix output formating [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/816839 (owner: 10Jbond) [10:28:13] (03PS6) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 [10:29:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P31925 and previous config saved to /var/cache/conftool/dbconfig/20220726-102951-marostegui.json [10:30:03] (03CR) 10MVernon: "Hi," [alerts] - 10https://gerrit.wikimedia.org/r/812883 (https://phabricator.wikimedia.org/T312765) (owner: 10Mark Bergsma) [10:32:59] (03CR) 10Volans: [C: 03+2] scripts/hiera_export: add ganeti group [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/810955 (owner: 10Volans) [10:33:55] (03Merged) 10jenkins-bot: scripts/hiera_export: add ganeti group [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/810955 (owner: 10Volans) [10:34:18] (03PS5) 10Klausman: ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [10:34:28] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:34:46] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:34:54] !log akosiaris@cumin1001 START - Cookbook sre.dns.netbox [10:35:06] (03CR) 10CI reject: [V: 04-1] ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [10:35:17] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36389/console" [puppet] - 10https://gerrit.wikimedia.org/r/817207 (https://phabricator.wikimedia.org/T313239) (owner: 10Marostegui) [10:35:20] (03PS1) 10Marostegui: db2089: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817216 (https://phabricator.wikimedia.org/T311493) [10:35:50] (03CR) 10CI reject: [V: 04-1] PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [10:36:31] (03CR) 10Marostegui: [C: 03+2] db2089: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817216 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [10:36:46] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48392 bytes in 7.851 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:36:58] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.272 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:38:18] (03CR) 10Volans: [C: 03+2] netbox::host: adapt to new Netbox data [puppet] - 10https://gerrit.wikimedia.org/r/810956 (owner: 10Volans) [10:38:30] !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update hieradata from Netbox - volans@cumin2002" [10:39:18] !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Update hieradata from Netbox - volans@cumin2002" [10:40:30] (03PS6) 10Klausman: ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [10:41:12] (03CR) 10CI reject: [V: 04-1] ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [10:41:25] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36390/console" [puppet] - 10https://gerrit.wikimedia.org/r/817207 (https://phabricator.wikimedia.org/T313239) (owner: 10Marostegui) [10:41:37] (03PS1) 10Marostegui: mariadb: Productionize db2171 [puppet] - 10https://gerrit.wikimedia.org/r/817218 (https://phabricator.wikimedia.org/T311493) [10:41:39] (03PS10) 10Jbond: P:ssh::client: use more modern functions for collecting sskey [puppet] - 10https://gerrit.wikimedia.org/r/816775 [10:41:41] (03PS10) 10Jbond: never merge, test doing the reduce in ruby [puppet] - 10https://gerrit.wikimedia.org/r/816852 [10:42:35] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2171 [puppet] - 10https://gerrit.wikimedia.org/r/817218 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [10:43:11] (03PS7) 10Klausman: ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [10:43:53] (03CR) 10CI reject: [V: 04-1] ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [10:44:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P31928 and previous config saved to /var/cache/conftool/dbconfig/20220726-104456-marostegui.json [10:44:59] (03PS11) 10Jbond: never merge, test doing the reduce in ruby [puppet] - 10https://gerrit.wikimedia.org/r/816852 [10:48:30] (03PS8) 10Klausman: ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [10:48:55] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add config for the k8s ml-staging codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/817201 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [10:50:01] (03CR) 10CI reject: [V: 04-1] ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [10:51:52] (03CR) 10Filippo Giunchedi: [C: 03+1] swift: stop flinging thumbnails at other DC in rewrite.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816726 (https://phabricator.wikimedia.org/T313102) (owner: 10MVernon) [10:56:27] (03PS9) 10Klausman: ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [10:57:09] (03CR) 10CI reject: [V: 04-1] ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [10:59:50] (03PS10) 10Klausman: ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [11:00:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T312990)', diff saved to https://phabricator.wikimedia.org/P31929 and previous config saved to /var/cache/conftool/dbconfig/20220726-110002-marostegui.json [11:00:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [11:00:07] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [11:00:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [11:00:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T312990)', diff saved to https://phabricator.wikimedia.org/P31930 and previous config saved to /var/cache/conftool/dbconfig/20220726-110022-marostegui.json [11:02:07] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36391/console" [puppet] - 10https://gerrit.wikimedia.org/r/817207 (https://phabricator.wikimedia.org/T313239) (owner: 10Marostegui) [11:02:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T312990)', diff saved to https://phabricator.wikimedia.org/P31931 and previous config saved to /var/cache/conftool/dbconfig/20220726-110258-marostegui.json [11:03:56] (03CR) 10Volans: "recheck" [software/cumin] - 10https://gerrit.wikimedia.org/r/813201 (owner: 10Volans) [11:04:18] (03PS2) 10Volans: sre.hosts.provision: ask to setup the RAID [cookbooks] - 10https://gerrit.wikimedia.org/r/812448 [11:04:40] (03PS1) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 [11:05:49] (03PS11) 10Klausman: ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [11:07:33] (03CR) 10Filippo Giunchedi: "Idea LGTM, what's the estimated filesystem bytes usage after these moves are completed? I'm asking because I suspect without most/all of t" [puppet] - 10https://gerrit.wikimedia.org/r/817209 (https://phabricator.wikimedia.org/T294549) (owner: 10MVernon) [11:08:27] (03PS1) 10Pwangai: admin: Add pwangai to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/817223 (https://phabricator.wikimedia.org/T313794) [11:08:33] (03PS2) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 [11:08:57] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:10:11] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36394/console" [puppet] - 10https://gerrit.wikimedia.org/r/817207 (https://phabricator.wikimedia.org/T313239) (owner: 10Marostegui) [11:10:46] (03PS12) 10Klausman: ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [11:11:32] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36395/console" [puppet] - 10https://gerrit.wikimedia.org/r/817207 (https://phabricator.wikimedia.org/T313239) (owner: 10Marostegui) [11:12:16] (03PS13) 10Klausman: ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [11:12:52] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:13:08] (03PS3) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 [11:13:12] (03PS1) 10Jbond: O:prometheus: drop old absent file [puppet] - 10https://gerrit.wikimedia.org/r/817224 [11:13:44] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36397/console" [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [11:13:58] (03CR) 10Jbond: [C: 03+2] O:prometheus: drop old absent file [puppet] - 10https://gerrit.wikimedia.org/r/817224 (owner: 10Jbond) [11:14:28] (03CR) 10Jbond: [V: 03+2 C: 03+2] O:prometheus: drop old absent file [puppet] - 10https://gerrit.wikimedia.org/r/817224 (owner: 10Jbond) [11:16:27] (03PS14) 10Klausman: ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [11:17:14] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36399/console" [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [11:17:37] PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [11:18:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P31932 and previous config saved to /var/cache/conftool/dbconfig/20220726-111803-marostegui.json [11:18:33] (03CR) 10Ssingh: [C: 03+2] aptrepo: add a component for ATS 9.x [puppet] - 10https://gerrit.wikimedia.org/r/816801 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [11:19:29] RECOVERY - Kafka broker TLS certificate validity on kafka-jumbo1002 is OK: SSL OK - Certificate kafka_jumbo-eqiad_broker valid until 2022-12-04 14:47:46 +0000 (expires in 131 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [11:20:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36398/console" [puppet] - 10https://gerrit.wikimedia.org/r/817221 (owner: 10Jbond) [11:20:36] (03PS15) 10Klausman: ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [11:21:20] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36401/console" [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [11:21:33] (03PS1) 10Phuedx: testwiki: Enable mediawiki.web_ui.interactions stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817225 (https://phabricator.wikimedia.org/T311268) [11:21:54] (03PS16) 10Klausman: ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [11:22:19] (03CR) 10Jelto: [V: 03+1 C: 04-1] "looks mostly good! I left a small note in the service unit file." [puppet] - 10https://gerrit.wikimedia.org/r/815769 (https://phabricator.wikimedia.org/T311746) (owner: 10Dduvall) [11:23:00] (03CR) 10Klausman: "https://puppet-compiler.wmflabs.org/pcc-worker1003/36401/ Shows the expected results (including a functional noop for the puppetboard host" [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [11:24:05] (03CR) 10Volans: [C: 03+2] CI: fix reported issues [software/cumin] - 10https://gerrit.wikimedia.org/r/813201 (owner: 10Volans) [11:26:37] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:28:29] (03PS11) 10Jbond: P:ssh::client: use more modern functions for collecting sskey [puppet] - 10https://gerrit.wikimedia.org/r/816775 [11:31:42] (03Merged) 10jenkins-bot: CI: fix reported issues [software/cumin] - 10https://gerrit.wikimedia.org/r/813201 (owner: 10Volans) [11:33:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P31933 and previous config saved to /var/cache/conftool/dbconfig/20220726-113308-marostegui.json [11:48:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T312990)', diff saved to https://phabricator.wikimedia.org/P31934 and previous config saved to /var/cache/conftool/dbconfig/20220726-114813-marostegui.json [11:48:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [11:48:18] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [11:48:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [11:48:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T312990)', diff saved to https://phabricator.wikimedia.org/P31935 and previous config saved to /var/cache/conftool/dbconfig/20220726-114833-marostegui.json [11:49:50] (03PS4) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 [11:51:21] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:52:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T312990)', diff saved to https://phabricator.wikimedia.org/P31936 and previous config saved to /var/cache/conftool/dbconfig/20220726-115204-marostegui.json [11:54:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36402/console" [puppet] - 10https://gerrit.wikimedia.org/r/817221 (owner: 10Jbond) [11:54:37] (03PS1) 10Marostegui: site.pp: Remove insetup from db217[0-1] [puppet] - 10https://gerrit.wikimedia.org/r/817251 (https://phabricator.wikimedia.org/T311493) [11:55:31] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db217[0-1] [puppet] - 10https://gerrit.wikimedia.org/r/817251 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [12:02:16] 10SRE, 10Data-Engineering, 10Event-Platform, 10serviceops: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10akosiaris) @ottomata, has there been any progress on this one? Anything (e.g. reviews) we can help with? [12:02:27] !log oblivian@deploy1002 Synchronized README: testing fix for php restarts T313770 (duration: 03m 15s) [12:02:31] T313770: scap no longer restarts php-fpm on canary servers - https://phabricator.wikimedia.org/T313770 [12:02:34] 10SRE, 10Data Engineering Planning, 10Data-Engineering-Kanban, 10Event-Platform, and 2 others: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10akosiaris) @ottomata, has there been any progress on this one? Anything (e.g. reviews) we can help with? [12:05:23] (03PS1) 10Ssingh: hiera: update key name for snake oil IP blocklist (Wikidough) [labs/private] - 10https://gerrit.wikimedia.org/r/817252 [12:05:54] (03CR) 10Ssingh: [V: 03+2 C: 03+2] hiera: update key name for snake oil IP blocklist (Wikidough) [labs/private] - 10https://gerrit.wikimedia.org/r/817252 (owner: 10Ssingh) [12:07:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P31937 and previous config saved to /var/cache/conftool/dbconfig/20220726-120709-marostegui.json [12:16:59] (03PS2) 10MVernon: swift: drain older systems, bring some new ones online [puppet] - 10https://gerrit.wikimedia.org/r/817209 (https://phabricator.wikimedia.org/T294549) [12:18:55] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:19:05] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:22:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P31938 and previous config saved to /var/cache/conftool/dbconfig/20220726-122214-marostegui.json [12:24:23] !log jnuche@deploy1002 Installing scap version "4.11.4" for 559 hosts [12:24:44] !log jnuche@deploy1002 Installation of scap version "4.11.4" completed for 559 hosts [12:25:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:25:35] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:29:25] RECOVERY - Check systemd state on wtp1040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:32:21] !log jnuche@deploy1002 Synchronized README: Verifying fix for T313770 (duration: 03m 14s) [12:32:25] T313770: scap no longer restarts php-fpm on canary servers - https://phabricator.wikimedia.org/T313770 [12:34:05] (03CR) 10Elukey: ores: Add additional local JSON logger to uwsgi (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [12:37:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T312990)', diff saved to https://phabricator.wikimedia.org/P31939 and previous config saved to /var/cache/conftool/dbconfig/20220726-123719-marostegui.json [12:37:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [12:37:26] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [12:37:34] (03CR) 10Nikerabbit: "$wgTranslateUseSandbox is false on WMF production, so this shouldn't be a problem there." [extensions/Translate] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/816778 (https://phabricator.wikimedia.org/T312008) (owner: 10Abijeet Patro) [12:37:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [12:37:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:37:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:37:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T312990)', diff saved to https://phabricator.wikimedia.org/P31940 and previous config saved to /var/cache/conftool/dbconfig/20220726-123745-marostegui.json [12:39:31] (03PS2) 10Alexandros Kosiaris: Add client side conf100[789] in DNS SRV records [dns] - 10https://gerrit.wikimedia.org/r/811885 (https://phabricator.wikimedia.org/T311407) [12:39:57] (03PS1) 10Bartosz Dziewoński: src/jquery: Move var declarations inline [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817231 [12:40:06] (03PS17) 10Klausman: ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [12:40:22] (03PS1) 10Bartosz Dziewoński: jquery.textSelection: Support more edge cases of document.execCommand [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817232 (https://phabricator.wikimedia.org/T33780) [12:41:07] jouncebot: next [12:41:07] In 0 hour(s) and 18 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220726T1300) [12:41:07] In 0 hour(s) and 18 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220726T1300) [12:41:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T312990)', diff saved to https://phabricator.wikimedia.org/P31941 and previous config saved to /var/cache/conftool/dbconfig/20220726-124112-marostegui.json [12:41:35] is the backport window happening? i saw that dire message yesterday, but it looks like the bug was just fixed (and i have a JS-only patch) [12:41:58] (03PS2) 10Bartosz Dziewoński: jquery.textSelection: Support more edge cases of document.execCommand [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817232 (https://phabricator.wikimedia.org/T33780) [12:43:12] (03CR) 10Klausman: [V: 03+1] ores: Add additional local JSON logger to uwsgi (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [12:43:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add client side conf100[789] in DNS SRV records [dns] - 10https://gerrit.wikimedia.org/r/811885 (https://phabricator.wikimedia.org/T311407) (owner: 10Alexandros Kosiaris) [12:45:48] (03CR) 10Elukey: ores: Add additional local JSON logger to uwsgi (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [12:47:51] (03PS18) 10Klausman: ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [12:49:27] (03PS19) 10Klausman: ores: Add additional local JSON logger to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [12:50:14] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36405/console" [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [12:51:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Setup work done." [puppet] - 10https://gerrit.wikimedia.org/r/816149 (https://phabricator.wikimedia.org/T311407) (owner: 10Jcrespo) [12:51:37] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:52:37] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:53:19] (03CR) 10Klausman: ores: Add additional local JSON logger to uwsgi (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [12:53:42] 10SRE, 10serviceops: Update conf1* servers - https://phabricator.wikimedia.org/T310062 (10akosiaris) [12:53:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install conf100[789] - https://phabricator.wikimedia.org/T301272 (10akosiaris) [12:53:55] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.292 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:54:55] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48390 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:56:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P31942 and previous config saved to /var/cache/conftool/dbconfig/20220726-125617-marostegui.json [12:59:21] hi there 👋 [12:59:38] https://phabricator.wikimedia.org/T313770 should be fixed now, the next backport window can go ahead [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220726T1300) [13:00:05] MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220726T1300) [13:00:15] (03PS1) 10Alexandros Kosiaris: conf100[456]: Remove them from client DNS RRs [dns] - 10https://gerrit.wikimedia.org/r/817260 (https://phabricator.wikimedia.org/T311408) [13:00:17] (03PS1) 10Alexandros Kosiaris: conf100[456]: Remove them from server SRV RRs [dns] - 10https://gerrit.wikimedia.org/r/817261 (https://phabricator.wikimedia.org/T311408) [13:00:19] jnuche: great, thanks! [13:00:24] hi [13:00:26] thanks jnuche [13:00:43] I could deploy later in the window but right now I’m busy with something else, sorry [13:00:52] (03PS1) 10Jbond: CHANGELOG: add changelogs for release v3.1.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/817262 [13:00:57] (03CR) 10FNegri: [C: 03+1] "LGTM, I tested it locally with a simple function and it works as expected!" [puppet] - 10https://gerrit.wikimedia.org/r/816841 (owner: 10Andrew Bogott) [13:01:27] (03CR) 10Jbond: [C: 03+2] CHANGELOG: add changelogs for release v3.1.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/817262 (owner: 10Jbond) [13:05:28] (03PS1) 10Sbisson: Register Wikistories streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817263 (https://phabricator.wikimedia.org/T313633) [13:06:31] (03PS1) 10Alexandros Kosiaris: Switch etcd clients to use conf100[789] [puppet] - 10https://gerrit.wikimedia.org/r/817264 (https://phabricator.wikimedia.org/T311408) [13:06:33] (03PS1) 10Alexandros Kosiaris: Switch zookeeper clients to use conf100[789] [puppet] - 10https://gerrit.wikimedia.org/r/817265 (https://phabricator.wikimedia.org/T311408) [13:06:35] (03PS1) 10Alexandros Kosiaris: Remove mentions of conf100[456] [puppet] - 10https://gerrit.wikimedia.org/r/817266 (https://phabricator.wikimedia.org/T311408) [13:06:45] i'm afk for a minute, but nearby if anyone is able to deploy my backports [13:08:33] (03CR) 10Jbond: [V: 03+2 C: 03+2] CHANGELOG: add changelogs for release v3.1.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/817262 (owner: 10Jbond) [13:09:13] MatmaRex: I can deploy for you if you're around [13:10:00] taavi: yeah. thanks [13:10:09] (03CR) 10Majavah: [C: 03+2] src/jquery: Move var declarations inline [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817231 (owner: 10Bartosz Dziewoński) [13:10:12] (03CR) 10Majavah: [C: 03+2] jquery.textSelection: Support more edge cases of document.execCommand [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817232 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński) [13:11:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P31943 and previous config saved to /var/cache/conftool/dbconfig/20220726-131122-marostegui.json [13:14:14] (03PS1) 10Jbond: Upstream release v3.1.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/817268 [13:14:32] (03PS1) 10Kevin Bazira: ml-services: Add eu, hu & hy wiki articletopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/817269 (https://phabricator.wikimedia.org/T313307) [13:14:35] (03CR) 10Jbond: [V: 03+2 C: 03+2] Upstream release v3.1.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/817268 (owner: 10Jbond) [13:18:14] (03CR) 10Vgutierrez: [C: 03+1] "basic testing done in our WMCS environment, LGTM" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554 (owner: 10Ssingh) [13:18:54] (03CR) 10Ssingh: [C: 03+2] Release 9.1.2-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554 (owner: 10Ssingh) [13:19:49] (03PS20) 10Klausman: ores: Increase uwsqi request buffer size to 8192 [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [13:20:47] (03PS10) 10Aqu: Puppetize spark3 installation and configs using conda-analytics env [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [13:20:55] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36406/console" [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [13:21:23] (03PS1) 10Ssingh: dnsdist: add support for IP blocklist [puppet] - 10https://gerrit.wikimedia.org/r/817270 [13:21:56] (03CR) 10Elukey: [C: 03+1] ores: Increase uwsqi request buffer size to 8192 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [13:25:23] (03PS21) 10Klausman: ores: Increase uwsgi request buffer size to 8192 [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [13:25:24] !log uploaded spicerack_3.1.1 to apt.wikimedia.org bullseye-wikimedia [13:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:40] (03CR) 10Klausman: [C: 03+2] ores: Increase uwsgi request buffer size to 8192 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [13:26:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T312990)', diff saved to https://phabricator.wikimedia.org/P31944 and previous config saved to /var/cache/conftool/dbconfig/20220726-132628-marostegui.json [13:26:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [13:26:37] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [13:26:43] (03PS22) 10Klausman: ores: Increase uwsgi request buffer size to 8192 [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) [13:26:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [13:26:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T312990)', diff saved to https://phabricator.wikimedia.org/P31945 and previous config saved to /var/cache/conftool/dbconfig/20220726-132650-marostegui.json [13:27:07] (03CR) 10Klausman: [V: 03+2] ores: Increase uwsgi request buffer size to 8192 [puppet] - 10https://gerrit.wikimedia.org/r/817210 (https://phabricator.wikimedia.org/T312550) (owner: 10Klausman) [13:27:59] (03CR) 10Aqu: [C: 03+1] "The new spark-env looks great." [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [13:28:09] alright, I can deploy [13:28:13] * Lucas_WMDE looks at MatmaRex’ change [13:28:39] oh dear, backports [13:28:42] those will take longer in CI [13:28:44] (03Merged) 10jenkins-bot: src/jquery: Move var declarations inline [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817231 (owner: 10Bartosz Dziewoński) [13:28:48] (03Merged) 10jenkins-bot: jquery.textSelection: Support more edge cases of document.execCommand [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817232 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński) [13:28:48] Lucas_WMDE: taavi is doing it :) [13:28:51] ah [13:29:09] I missed that message above, thanks! [13:30:16] MatmaRex: pulled to mwdebug1001, can you test please? [13:30:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T312990)', diff saved to https://phabricator.wikimedia.org/P31946 and previous config saved to /var/cache/conftool/dbconfig/20220726-133023-marostegui.json [13:30:58] taavi: looking [13:32:37] taavi: seems good [13:32:45] great, syncing [13:35:54] !log taavi@deploy1002 Synchronized php-1.39.0-wmf.21/resources/src/jquery/jquery.textSelection.js: backporting gerrit r817231 r817232 for wmf.21, T33780 (duration: 03m 02s) [13:35:59] T33780: WikiEditor dialogs kill the undo buffer - https://phabricator.wikimedia.org/T33780 [13:36:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:36:11] all done, unless someone else has something to deploy [13:36:50] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [13:36:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:36:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:37:36] that alert is for codfw, so I'm ignoring it [13:37:52] taavi: thanks. and thanks Lucas_WMDE for volunteering [13:37:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:38:11] !log UTC afternoon deploys done [13:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:59] \o/ [13:39:14] 10SRE: uwsgi socket/UDP logger is broken if no other logger uses the same format - https://phabricator.wikimedia.org/T312550 (10klausman) 05Open→03Resolved Change 817210 actually fixes this, we now see messages in logstash again. Apparently, an unset buffer size causes JSON generation to break. The upstream... [13:39:24] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [13:41:15] (03PS2) 10Ssingh: dnsdist: add support for IP blocklist [puppet] - 10https://gerrit.wikimedia.org/r/817270 [13:42:58] I’ll test out a possible config change on mwdebug1002 [13:43:08] (test out? try out. english!) [13:43:12] (03PS3) 10Ssingh: dnsdist: add support for IP blocklist [puppet] - 10https://gerrit.wikimedia.org/r/817270 [13:45:21] (03PS4) 10Ssingh: dnsdist: add support for IP blocklist [puppet] - 10https://gerrit.wikimedia.org/r/817270 [13:45:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P31947 and previous config saved to /var/cache/conftool/dbconfig/20220726-134529-marostegui.json [13:45:44] (03PS1) 10FNegri: Add FNegri to Icinga authorized users [puppet] - 10https://gerrit.wikimedia.org/r/817276 (https://phabricator.wikimedia.org/T312597) [13:49:56] (03PS5) 10Ssingh: dnsdist: add support for IP blocklist [puppet] - 10https://gerrit.wikimedia.org/r/817270 [13:50:33] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36411/console" [puppet] - 10https://gerrit.wikimedia.org/r/817270 (owner: 10Ssingh) [13:53:45] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Feel free to merge!" [puppet] - 10https://gerrit.wikimedia.org/r/817276 (https://phabricator.wikimedia.org/T312597) (owner: 10FNegri) [14:00:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P31949 and previous config saved to /var/cache/conftool/dbconfig/20220726-140034-marostegui.json [14:00:41] (03CR) 10AikoChou: [C: 03+1] ml-services: Add eu, hu & hy wiki articletopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/817269 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira) [14:01:41] ok, I’m done testing on mwdebug1002 for now [14:15:02] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:puppet::agent: Add parameter to install puppet-agent 7 [puppet] - 10https://gerrit.wikimedia.org/r/816205 (owner: 10Jbond) [14:15:14] (03CR) 10Jelto: "The PTR records seem to be in the wrong file. I left inline comments with more details." [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [14:15:18] (03PS2) 10Phuedx: testwiki: Add mediawiki.web_ui.interactions stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817225 (https://phabricator.wikimedia.org/T311268) [14:15:22] (03CR) 10Ssingh: [C: 03+1] Icinga: Remove traffic alerts [puppet] - 10https://gerrit.wikimedia.org/r/814894 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [14:15:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T312990)', diff saved to https://phabricator.wikimedia.org/P31950 and previous config saved to /var/cache/conftool/dbconfig/20220726-141540-marostegui.json [14:15:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [14:15:45] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [14:15:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [14:15:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 9 hosts with reason: Maintenance [14:16:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 9 hosts with reason: Maintenance [14:17:30] (03CR) 10Brennen Bearnes: [C: 03+1] aptrepo: update gitlab-ce & gitlab-runner to 15.0 [puppet] - 10https://gerrit.wikimedia.org/r/817211 (https://phabricator.wikimedia.org/T309062) (owner: 10Jelto) [14:18:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [14:18:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [14:18:47] (03CR) 10Cwhite: [C: 03+1] prometheus: update blackbox check alerts runbook link (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816135 (https://phabricator.wikimedia.org/T312947) (owner: 10Filippo Giunchedi) [14:19:13] (03CR) 10Jbond: [C: 03+1] admin: add mfossati to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/817191 (https://phabricator.wikimedia.org/T313706) (owner: 10Volans) [14:20:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [14:20:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [14:26:10] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:27:03] (03PS1) 10MSantos: mobileapps: bump to 2022-07-26-132542-productio [deployment-charts] - 10https://gerrit.wikimedia.org/r/817279 [14:41:06] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Marostegui) [14:44:23] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10ssingh) [14:45:31] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10Eevans) [14:45:52] 10SRE, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10Eevans) [14:47:13] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10Eevans) [14:47:25] 10SRE, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10Eevans) [14:48:15] 10SRE, 10Data Engineering Planning, 10Data-Engineering-Kanban, 10Event-Platform, and 2 others: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10JArguello-WMF) Hi @akosiaris, Andrew is Out of office and will be back on Fri, Jul 29. :) [14:48:41] (03CR) 10Jbond: "See comments i think it would be usefull to have a chat about this" [puppet] - 10https://gerrit.wikimedia.org/r/817270 (owner: 10Ssingh) [14:49:02] (03CR) 10Vgutierrez: [C: 03+1] "- PCC looking good: https://puppet-compiler.wmflabs.org/pcc-worker1001/36412/" [puppet] - 10https://gerrit.wikimedia.org/r/810313 (https://phabricator.wikimedia.org/T311386) (owner: 10Giuseppe Lavagetto) [14:50:57] 10SRE, 10Data Engineering Planning, 10Data-Engineering-Kanban, 10Event-Platform, and 2 others: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10akosiaris) >>! In T303543#8104853, @JArguello-WMF wrote: > Hi @akosiaris, Andrew is Out of office and will be back on Fri... [14:51:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2143 with weight 0 T313811', diff saved to https://phabricator.wikimedia.org/P31951 and previous config saved to /var/cache/conftool/dbconfig/20220726-145116-root.json [14:51:22] T313811: Switchover x2 master db2142 -> db2143 - https://phabricator.wikimedia.org/T313811 [14:52:24] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10BCornwall) [14:52:36] !log upload trafficserver_9.1.2-1wm1_amd64 to apt.wm.o (buster) - T309651 [14:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:40] T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 [14:53:18] ^^ component/trafficserver9 :) [14:53:26] yeah that's in the command [14:53:27] 10SRE, 10Traffic, 10Patch-For-Review: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 (10ssingh) [14:53:45] add buster-wikimedia deb component/trafficserver9 amd64 trafficserver 9.1.2-1wm1 -- pool/component/trafficserver9/t/trafficserver/trafficserver_9.1.2-1wm1_amd64.deb [14:54:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2144 with weight 0 and db2143 back with 100 T313811', diff saved to https://phabricator.wikimedia.org/P31952 and previous config saved to /var/cache/conftool/dbconfig/20220726-145412-root.json [14:55:05] 10SRE, 10Security-Team, 10SecTeam-Processed, 10Security: Host crossdomain.xml master policy file - https://phabricator.wikimedia.org/T75574 (10sbassett) [14:58:37] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Add WikibaseTerms temporary debug log channel" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817285 (https://phabricator.wikimedia.org/T313039) [14:59:39] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Marostegui) [15:00:40] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:02:56] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Marostegui) [15:03:30] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Marostegui) All mysql hosts need mysql to be stopped before the maintenance. [15:03:48] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Marostegui) All mysql hosts need mysql to be stopped before the maintenance. [15:04:20] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Marostegui) All mysql hosts need mysql to be stopped before the maintenance. [15:04:42] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Add suppport for multiple backup/passive nodes in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/814848 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [15:06:10] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Add suppport for multiple backup/passive nodes in Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814848 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [15:10:08] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10jcrespo) [15:12:08] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:16:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] lvs: check php 7.4 too on all appservers [puppet] - 10https://gerrit.wikimedia.org/r/810313 (https://phabricator.wikimedia.org/T311386) (owner: 10Giuseppe Lavagetto) [15:16:27] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10Papaul) [15:17:24] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul) [15:18:03] <_joe_> !log restarting pybal on lvs2010 to check php 7.4 too [15:18:04] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul) [15:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:30] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10Papaul) [15:20:45] (03CR) 10Michael Große: [C: 03+1] Revert "Add WikibaseTerms temporary debug log channel" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817285 (https://phabricator.wikimedia.org/T313039) (owner: 10Lucas Werkmeister (WMDE)) [15:23:39] <_joe_> !log restarting pybal on lvs2009 to check php 7.4 too [15:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:25:39] <_joe_> !log restarting pybal on lvs1019 to check php 7.4 too [15:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:51] <_joe_> XioNoX: every time I see that error I cringe [15:25:59] <_joe_> it happens regularly when restarting pybal [15:26:37] <_joe_> when restarting the primary pybal server I mean [15:27:00] jouncebot: now [15:27:00] No deployments scheduled for the next 0 hour(s) and 32 minute(s) [15:27:21] I’ll deploy a no-op log channel cleanup if that’s okay with everyone https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/817285 [15:28:20] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "Add WikibaseTerms temporary debug log channel" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817285 (https://phabricator.wikimedia.org/T313039) (owner: 10Lucas Werkmeister (WMDE)) [15:28:50] _joe_: you can make it a cookbook that downtime it before the restart :) [15:29:19] <3 [15:30:41] <_joe_> !log restarting pybal on lvs1020 to check php 7.4 too [15:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:39] (03Merged) 10jenkins-bot: Revert "Add WikibaseTerms temporary debug log channel" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817285 (https://phabricator.wikimedia.org/T313039) (owner: 10Lucas Werkmeister (WMDE)) [15:33:15] syncing ^ [15:33:31] (03PS1) 10Jcrespo: Initial commit [software/pampinus] - 10https://gerrit.wikimedia.org/r/817294 (https://phabricator.wikimedia.org/T283017) [15:34:40] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [15:35:22] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2084 - https://phabricator.wikimedia.org/T313121 (10Papaul) [15:35:59] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2078 - https://phabricator.wikimedia.org/T312754 (10Papaul) [15:36:09] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2084 - https://phabricator.wikimedia.org/T313121 (10Papaul) 05Open→03Resolved Complete [15:36:34] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2082 - https://phabricator.wikimedia.org/T313003 (10Papaul) [15:36:36] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2078 - https://phabricator.wikimedia.org/T312754 (10Papaul) 05Open→03Resolved Complete [15:36:37] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:817285|Revert "Add WikibaseTerms temporary debug log channel" (T313039)]] (grep confirms wmf.21+ code has no mentions of this channel) (duration: 03m 19s) [15:36:42] T313039: Remove debug logging for item terms storage after merging - https://phabricator.wikimedia.org/T313039 [15:36:57] ok, I’m done :) [15:37:02] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2082 - https://phabricator.wikimedia.org/T313003 (10Papaul) 05Open→03Resolved Complete [15:37:12] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [15:38:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:40:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:40:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:40:11] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for EllenR - https://phabricator.wikimedia.org/T313821 (10ERayfield) [15:40:51] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) >>! In T138093#8092400, @ori wrote: > - On that subject: we need to validate that query-sorting is safe for CXServer (or else exc... [15:40:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:42:43] (03PS12) 10Dduvall: gitlab_runner: Handle changes to runner config [puppet] - 10https://gerrit.wikimedia.org/r/815769 (https://phabricator.wikimedia.org/T311746) [15:44:48] (03CR) 10Dduvall: [C: 03+1] "Thanks for the review, Jelto. All fixed! I think the round of testing I did should be sufficient. Just note that runners will have to be r" [puppet] - 10https://gerrit.wikimedia.org/r/815769 (https://phabricator.wikimedia.org/T311746) (owner: 10Dduvall) [15:46:32] 10ops-codfw: codfw: ml-serve2001 memmory issue DIMM A2 - https://phabricator.wikimedia.org/T313822 (10Papaul) [15:48:08] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:49:25] (03CR) 10Thcipriani: [C: 03+1] Add dancy to phabricator-roots [puppet] - 10https://gerrit.wikimedia.org/r/816035 (https://phabricator.wikimedia.org/T313551) (owner: 10Ahmon Dancy) [15:49:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] shellbox: Restore replicas to 8, now that T312319 is resolved. [deployment-charts] - 10https://gerrit.wikimedia.org/r/816873 (https://phabricator.wikimedia.org/T310557) (owner: 10RLazarus) [15:51:03] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (The Decommission Mission 💀): Add dancy to phabricator-roots - https://phabricator.wikimedia.org/T313551 (10thcipriani) [15:51:53] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (The Decommission Mission 💀): Add dancy to phabricator-roots - https://phabricator.wikimedia.org/T313551 (10thcipriani) >>! In T313551#8095985, @dancy wrote: > Noting that my manager @thcipriani is on vacation right now. Approv... [15:52:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:52:19] (03PS1) 10Jbond: P:cache::varnish::frontend: Drop confd_experiment_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/817298 (https://phabricator.wikimedia.org/T288106) [15:52:20] (03PS1) 10Jbond: P:cache::varnish::frontend: remove parse_abuse nets [puppet] - 10https://gerrit.wikimedia.org/r/817299 [15:53:48] 10SRE, 10Traffic-Icebox: acme-chief should be able to refresh OCSP stapling response even if the renewal process fails - https://phabricator.wikimedia.org/T244232 (10BCornwall) a:03BCornwall [15:55:41] (03Merged) 10jenkins-bot: shellbox: Restore replicas to 8, now that T312319 is resolved. [deployment-charts] - 10https://gerrit.wikimedia.org/r/816873 (https://phabricator.wikimedia.org/T310557) (owner: 10RLazarus) [15:56:06] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply [15:56:09] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [15:56:25] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply [15:56:36] (03CR) 10Thcipriani: [C: 03+1] admin: add mfossati to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/817191 (https://phabricator.wikimedia.org/T313706) (owner: 10Volans) [15:56:54] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [15:58:10] (03CR) 10Volans: [C: 03+2] admin: add mfossati to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/817191 (https://phabricator.wikimedia.org/T313706) (owner: 10Volans) [15:58:21] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [15:58:32] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [15:58:42] denisse|m: I got a commit for you too to puppet-merge, is that fine to merge? [15:58:55] netmon: Add suppport for multiple backup/passive nodes in Puppet (a4ac6acce8) [16:00:05] jbond and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220726T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:28] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to maintenance servers for mfossati - https://phabricator.wikimedia.org/T313706 (10thcipriani) >>! In T313706#8101237, @Volans wrote: > I think that you want the `restricted` group: > ` > description: access to mwmaint hosts, mwlog hosts (pr... [16:00:47] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to maintenance servers for mfossati - https://phabricator.wikimedia.org/T313706 (10thcipriani) [16:02:20] cwhite, godog maybe you can answer to the above question (puppet-merge of a patch) as you reviewed it [16:03:34] volans: yes, thank you very much. :) [16:03:38] (03CR) 10Elukey: [C: 03+2] prometheus: add config for the k8s ml-staging codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/817201 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [16:03:48] denisse|m: ack, doing [16:04:04] denisse|m: {done} [16:04:37] (03CR) 10Andrew Bogott: [C: 03+2] Add fnegri to contactgroups.cfg [puppet] - 10https://gerrit.wikimedia.org/r/816837 (https://phabricator.wikimedia.org/T312597) (owner: 10FNegri) [16:07:37] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to maintenance servers for mfossati - https://phabricator.wikimedia.org/T313706 (10Volans) Patch has been merged, it will be reflected in the fleet within ~30 minutes. @mfossati after 16:35 UTC you can verify you've access and if it's all go... [16:09:38] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:12:19] (03CR) 10FNegri: [C: 03+2] Add FNegri to Icinga authorized users [puppet] - 10https://gerrit.wikimedia.org/r/817276 (https://phabricator.wikimedia.org/T312597) (owner: 10FNegri) [16:13:45] 10SRE, 10Domains: domain name Wikkipedia.be - https://phabricator.wikimedia.org/T313823 (10Walter) [16:14:15] (03PS2) 10Volans: Add dancy to phabricator-roots [puppet] - 10https://gerrit.wikimedia.org/r/816035 (https://phabricator.wikimedia.org/T313551) (owner: 10Ahmon Dancy) [16:15:56] (03CR) 10Volans: [C: 03+2] "approved on task" [puppet] - 10https://gerrit.wikimedia.org/r/816035 (https://phabricator.wikimedia.org/T313551) (owner: 10Ahmon Dancy) [16:17:33] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (The Decommission Mission 💀): Add dancy to phabricator-roots - https://phabricator.wikimedia.org/T313551 (10Volans) 05Open→03Resolved a:03Volans The patch has been merged, it will be reflected in the fleet within the last 30... [16:20:03] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:23:49] 10SRE, 10conftool: Add requestctl support to ferm - https://phabricator.wikimedia.org/T313825 (10jbond) [16:28:04] (03PS1) 10Jbond: P:base::firewall: Add requestctl definitions to ferm [puppet] - 10https://gerrit.wikimedia.org/r/817307 (https://phabricator.wikimedia.org/T313825) [16:32:40] !log brennen@deploy1002 Started deploy [phabricator/deployment@8a7d4bf]: no-op demonstration deploy to phab2001 [16:32:58] (03PS2) 10Jbond: P:base::firewall: Add requestctl definitions to ferm [puppet] - 10https://gerrit.wikimedia.org/r/817307 (https://phabricator.wikimedia.org/T313825) [16:33:06] !log brennen@deploy1002 Finished deploy [phabricator/deployment@8a7d4bf]: no-op demonstration deploy to phab2001 (duration: 00m 26s) [16:33:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36414/console" [puppet] - 10https://gerrit.wikimedia.org/r/817307 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [16:47:02] (03CR) 10Jbond: [V: 03+1] P:base::firewall: Add requestctl definitions to ferm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817307 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [16:47:12] (03PS3) 10Jbond: P:base::firewall: Add requestctl definitions to ferm [puppet] - 10https://gerrit.wikimedia.org/r/817307 (https://phabricator.wikimedia.org/T313825) [16:50:07] (03PS1) 10TrainBranchBot: testwikis wikis to 1.39.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817308 (https://phabricator.wikimedia.org/T308075) [16:50:09] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.39.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817308 (https://phabricator.wikimedia.org/T308075) (owner: 10TrainBranchBot) [16:50:14] 10SRE, 10conftool, 10Patch-For-Review: Add requestctl support to ferm - https://phabricator.wikimedia.org/T313825 (10jbond) p:05Triage→03Medium [16:51:02] (03CR) 10Dzahn: [C: 03+2] "comments only :) thanks" [puppet] - 10https://gerrit.wikimedia.org/r/807176 (owner: 10Dzahn) [16:51:32] (03CR) 10Jbond: dnsdist: add support for IP blocklist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817270 (owner: 10Ssingh) [16:51:34] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817308 (https://phabricator.wikimedia.org/T308075) (owner: 10TrainBranchBot) [16:51:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:52:33] !log brennen@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.22 refs T308075 [16:52:37] T308075: 1.39.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T308075 [16:53:05] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:54:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:54:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:57:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:59:56] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10RobH) [17:00:14] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10RobH) [17:00:30] 10SRE, 10SRE-Access-Requests: Requesting access to phab1001/phab2001 for Daniel Duvall (dduvall) - https://phabricator.wikimedia.org/T313831 (10dduvall) [17:02:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:03:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:03:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:04:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:05:27] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2022-07-26-132542-productio [deployment-charts] - 10https://gerrit.wikimedia.org/r/817279 (owner: 10MSantos) [17:07:39] 10SRE, 10serviceops: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10RobH) [17:08:49] (03Merged) 10jenkins-bot: mobileapps: bump to 2022-07-26-132542-productio [deployment-charts] - 10https://gerrit.wikimedia.org/r/817279 (owner: 10MSantos) [17:09:16] !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [17:09:48] !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [17:09:59] !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [17:10:49] !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [17:11:20] !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [17:12:07] !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [17:15:53] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:16:24] (03CR) 10Ssingh: [V: 03+1] dnsdist: add support for IP blocklist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817270 (owner: 10Ssingh) [17:20:50] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2085 - https://phabricator.wikimedia.org/T313239 (10Papaul) [17:21:56] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2085 - https://phabricator.wikimedia.org/T313239 (10Papaul) 05Open→03Resolved complete [17:22:54] 10SRE, 10Traffic: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10AndyRussG) Wooohoooo, many congrats on this merge @BBlack and @Vgutierrez!!!!! Also many thanks for the explanation and many apologies for the long delay in replying @Vgutierrez!! Just addin... [17:28:23] !log brennen@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.22 refs T308075 (duration: 35m 50s) [17:28:28] T308075: 1.39.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T308075 [17:29:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:36:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:36:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:40:11] !log bking@cumin1001 conftool action : set/pooled=inactive; selector: name=elastic2049 [17:42:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:43:47] 10SRE, 10Infrastructure-Foundations, 10netops: Lumen link between cr2-eqiad and cr2-esams down - July 2022 - https://phabricator.wikimedia.org/T313783 (10ayounsi) Circuit back up as of 2022-07-26 12:18:32 UTC (05:19:01 ago). Lumen got back to me saying it's working as expected for them. I asked for an RFO a... [17:47:54] (03PS1) 10DCausse: [WIP] Tune wikidata language selector autocomplete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817317 (https://phabricator.wikimedia.org/T307869) [17:50:11] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:51:30] (03CR) 10DCausse: [WIP] Tune wikidata language selector autocomplete (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817317 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [17:57:34] 10SRE-OnFire, 10observability, 10SRE Observability (FY2022/2023-Q1): Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10lmata) Feedback from Splunk support (anonymized): > the primary reason for the behavior f... [17:58:37] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:04] brennen and jeena: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220726T1800). [18:00:55] o/ [18:01:25] o/ [18:01:35] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:03:19] (03PS1) 10TrainBranchBot: group0 wikis to 1.39.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817320 (https://phabricator.wikimedia.org/T308075) [18:03:21] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.39.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817320 (https://phabricator.wikimedia.org/T308075) (owner: 10TrainBranchBot) [18:04:26] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817320 (https://phabricator.wikimedia.org/T308075) (owner: 10TrainBranchBot) [18:04:28] !log [doc1002:~] $ sudo systemctl start rsync-doc-doc2001.codfw.wmnet.service [18:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:44] (03PS13) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) [18:05:55] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:07:01] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:08:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:08:28] (03CR) 10BCornwall: [C: 03+1] "Thanks for the details on why this change exists! If I understand this correctly this is narrowing down the scope of abuse_nets to only sy" [puppet] - 10https://gerrit.wikimedia.org/r/817299 (owner: 10Jbond) [18:08:46] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.22 refs T308075 [18:08:52] T308075: 1.39.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T308075 [18:09:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:09:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:10:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:19:59] 10SRE, 10Infrastructure-Foundations, 10netops: Lumen link between cr2-eqiad and cr2-esams down - July 2022 - https://phabricator.wikimedia.org/T313783 (10Volans) p:05Triage→03Medium [18:20:33] (03CR) 10BCornwall: [C: 03+1] "Quoth @ema:" [puppet] - 10https://gerrit.wikimedia.org/r/817298 (https://phabricator.wikimedia.org/T288106) (owner: 10Jbond) [18:23:06] 10SRE, 10ops-codfw, 10Machine-Learning-Team: codfw: ml-serve2001 memmory issue DIMM A2 - https://phabricator.wikimedia.org/T313822 (10Volans) p:05Triage→03High [18:24:13] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frdb1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T313607 (10RobH) [18:30:48] 10SRE-OnFire, 10observability, 10SRE Observability (FY2022/2023-Q1): Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10CDanis) So I think we have three options here: 1. My originally-proposed routing key hack... [18:40:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:41:53] !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission for hosts elastic2049.codfw.wmnet [18:44:15] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts elastic2049.codfw.wmnet [18:44:46] (03PS1) 10Ryan Kemper: elastic: decom elastic2049 [puppet] - 10https://gerrit.wikimedia.org/r/817346 (https://phabricator.wikimedia.org/T311939) [18:45:24] (03PS5) 10Southparkfan: rsyslog: allow specifying TLS client auth settings and filename property [puppet] - 10https://gerrit.wikimedia.org/r/816046 (https://phabricator.wikimedia.org/T127717) [18:46:06] (03PS2) 10Ryan Kemper: elastic: decom elastic2049 [puppet] - 10https://gerrit.wikimedia.org/r/817346 (https://phabricator.wikimedia.org/T311939) [18:47:03] (03PS3) 10Ryan Kemper: elastic: decom elastic2049 [puppet] - 10https://gerrit.wikimedia.org/r/817346 (https://phabricator.wikimedia.org/T313842) [18:47:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:47:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:47:19] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:50:41] (03PS4) 10Ryan Kemper: elastic: decom elastic2049 [puppet] - 10https://gerrit.wikimedia.org/r/817346 (https://phabricator.wikimedia.org/T313842) [18:51:39] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/817346 (https://phabricator.wikimedia.org/T313842) (owner: 10Ryan Kemper) [18:52:01] (03CR) 10Ryan Kemper: [C: 03+2] elastic: decom elastic2049 [puppet] - 10https://gerrit.wikimedia.org/r/817346 (https://phabricator.wikimedia.org/T313842) (owner: 10Ryan Kemper) [18:53:02] !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission for hosts elastic2049.codfw.wmnet [18:53:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:53:49] 10SRE-OnFire, 10observability, 10SRE Observability (FY2022/2023-Q1): Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10CDanis) wrt option #2: having just tried it via `curl`, it looks like we will have to dele... [18:54:11] 10SRE, 10ops-codfw, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10Gehel) Decom is tracked in T313842 [18:59:04] !log ryankemper@cumin1001 START - Cookbook sre.dns.netbox [19:01:39] (03PS1) 10RLazarus: requestctl: Add a reminder to "requestctl commit" after enable/disable [software/conftool] - 10https://gerrit.wikimedia.org/r/817351 (https://phabricator.wikimedia.org/T305580) [19:03:24] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:03:24] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts elastic2049.codfw.wmnet [19:06:14] 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2049.codfw.wmnet - https://phabricator.wikimedia.org/T313842 (10Gehel) [19:06:27] 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2049.codfw.wmnet - https://phabricator.wikimedia.org/T313842 (10Gehel) a:05bking→03Papaul [19:08:23] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:10:09] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:16:43] (03CR) 10Nskaggs: "Is the intent to convert mon.yaml,load_all.yaml etc as well? I'm assuming the intent is to convert to double quotes? https://yaml.org/spec" [puppet] - 10https://gerrit.wikimedia.org/r/816005 (owner: 10David Caro) [19:19:05] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:27:19] (03CR) 10Ssingh: [V: 03+1] dnsdist: add support for IP blocklist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817270 (owner: 10Ssingh) [19:33:12] (03CR) 10Dzahn: dnsdist: add support for IP blocklist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817270 (owner: 10Ssingh) [19:34:04] (03CR) 10Ssingh: [V: 03+1] dnsdist: add support for IP blocklist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817270 (owner: 10Ssingh) [19:36:25] (03CR) 10Dzahn: dnsdist: add support for IP blocklist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817270 (owner: 10Ssingh) [19:39:54] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install logstash203[67] - https://phabricator.wikimedia.org/T313848 (10RobH) [19:40:33] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install logstash203[67] - https://phabricator.wikimedia.org/T313848 (10RobH) [19:40:48] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install logstash203[67] - https://phabricator.wikimedia.org/T313848 (10RobH) [19:41:09] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-backup: fix Retrying() call [puppet] - 10https://gerrit.wikimedia.org/r/816841 (owner: 10Andrew Bogott) [19:42:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10RobH) [19:43:00] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-backup: fix Retrying() call (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816841 (owner: 10Andrew Bogott) [19:43:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10RobH) [19:43:13] (03PS3) 10Andrew Bogott: wmcs-cinder-backup: fix Retrying() call [puppet] - 10https://gerrit.wikimedia.org/r/816841 [19:43:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10RobH) [19:44:27] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:46:37] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-backup: fix Retrying() call [puppet] - 10https://gerrit.wikimedia.org/r/816841 (owner: 10Andrew Bogott) [19:49:42] (03PS1) 10Ssingh: Update Wikidough profile data to read from common.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/817369 [19:50:35] (03CR) 10Ssingh: [V: 03+2 C: 03+2] Update Wikidough profile data to read from common.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/817369 (owner: 10Ssingh) [19:50:49] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:58:11] (03PS6) 10Ssingh: dnsdist: add support for IP blocklist [puppet] - 10https://gerrit.wikimedia.org/r/817270 [19:58:56] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36415/console" [puppet] - 10https://gerrit.wikimedia.org/r/817270 (owner: 10Ssingh) [19:59:31] (03PS7) 10Ssingh: dnsdist: add support for IP blocklist [puppet] - 10https://gerrit.wikimedia.org/r/817270 [20:00:04] RoanKattouw, Urbanecm, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220726T2000). [20:00:05] koi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:18] o/ [20:00:21] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36416/console" [puppet] - 10https://gerrit.wikimedia.org/r/817270 (owner: 10Ssingh) [20:03:38] hi koi: i can deploy your patch [20:03:44] 1 sec [20:03:52] (03PS1) 10Stang: ptwiki: Restrict "move" permission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817373 (https://phabricator.wikimedia.org/T313802) [20:05:36] (03CR) 10Clare Ming: [C: 03+2] etwikiquote: Change logo for 10k articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816705 (https://phabricator.wikimedia.org/T313698) (owner: 10Stang) [20:07:08] (03Merged) 10jenkins-bot: etwikiquote: Change logo for 10k articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816705 (https://phabricator.wikimedia.org/T313698) (owner: 10Stang) [20:07:41] koi: on mwdebug1002 - can you test? [20:07:47] looking [20:08:39] cjming: LGTM [20:08:40] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install graphite2004 - https://phabricator.wikimedia.org/T313851 (10RobH) [20:08:48] koi: great - will sync then [20:08:56] (03PS8) 10Ssingh: dnsdist: add support for IP blocklist [puppet] - 10https://gerrit.wikimedia.org/r/817270 [20:09:09] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install graphite2004 - https://phabricator.wikimedia.org/T313851 (10RobH) [20:09:17] RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1004 is OK: 1 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [20:09:29] RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1007 is OK: 1 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [20:09:37] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36417/console" [puppet] - 10https://gerrit.wikimedia.org/r/817270 (owner: 10Ssingh) [20:10:14] ACKNOWLEDGEMENT - Query Service HTTP Port on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time Brian_King Investigating now https://wikitech.wikimedia.org/wiki/Wikidata_query_service [20:10:14] ACKNOWLEDGEMENT - WDQS SPARQL on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 332 bytes in 1.048 second response time Brian_King Investigating now https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbo [20:10:17] RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is OK: 1 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [20:10:29] RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1006 is OK: 1 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [20:10:33] RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1003 is OK: 1 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [20:12:48] !log cjming@deploy1002 Synchronized static/images/project-logos/: Config: [[gerrit:816705|etwikiquote: Change logo for 10k articles (T313698)]] (duration: 03m 28s) [20:12:55] T313698: Requesting temporary logo change for et.wikiquote.org - https://phabricator.wikimedia.org/T313698 [20:14:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10observability: Q1:rack/setup/install graphite1005 - https://phabricator.wikimedia.org/T313853 (10RobH) [20:14:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:15:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10observability: Q1:rack/setup/install graphite1005 - https://phabricator.wikimedia.org/T313853 (10RobH) [20:16:11] !log cjming@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:816705|etwikiquote: Change logo for 10k articles (T313698)]] (duration: 03m 15s) [20:19:34] !log cjming@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:816705|etwikiquote: Change logo for 10k articles (T313698)]] (duration: 03m 07s) [20:19:38] T313698: Requesting temporary logo change for et.wikiquote.org - https://phabricator.wikimedia.org/T313698 [20:19:45] koi: your patch should be live - lmk if not [20:20:08] looks nice now, thanks [20:20:12] np! [20:20:58] !log end of UTC late backport window [20:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:21:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:21:34] !log depool wdqs1004 [20:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:08] (03PS1) 10Nskaggs: Expand retry logic for cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/817378 (https://phabricator.wikimedia.org/T310640) [20:25:43] (03CR) 10CI reject: [V: 04-1] Expand retry logic for cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/817378 (https://phabricator.wikimedia.org/T310640) (owner: 10Nskaggs) [20:27:16] !log bking@wdqs1004 restarted blazegraph services that were (are?) alerting for 503 [20:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:31] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.080 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:27:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:27:57] RECOVERY - Query Service HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [20:34:40] (03PS2) 10Nskaggs: Expand retry logic for cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/817378 (https://phabricator.wikimedia.org/T310640) [20:38:03] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti203[12] - https://phabricator.wikimedia.org/T313856 (10RobH) [20:39:25] 10SRE, 10Infrastructure-Foundations: ganeti203[12] implementation tracking - https://phabricator.wikimedia.org/T313857 (10RobH) [20:39:46] 10SRE, 10Infrastructure-Foundations: ganeti203[12] implementation tracking - https://phabricator.wikimedia.org/T313857 (10RobH) [20:40:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti203[12] - https://phabricator.wikimedia.org/T313856 (10RobH) [20:52:11] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:55:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10observability: Q1:rack/setup/install centrallog1002 - https://phabricator.wikimedia.org/T313858 (10RobH) [20:55:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10observability: Q1:rack/setup/install centrallog1002 - https://phabricator.wikimedia.org/T313858 (10RobH) [21:11:02] 10SRE, 10SRE-Access-Requests: Requesting access to phab1001/phab2001 for Daniel Duvall (dduvall) - https://phabricator.wikimedia.org/T313831 (10Dzahn) The following groups exist: ` phabricator-admin: gid: 746 description: Users who can do sane CLI admin things * Remove repositories... [21:11:52] (03CR) 10Andrew Bogott: Expand retry logic for cinder backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817378 (https://phabricator.wikimedia.org/T310640) (owner: 10Nskaggs) [21:15:06] 10SRE, 10Observability-Logging, 10Patch-For-Review: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10colewhite) [21:15:34] 10SRE, 10Elasticsearch, 10Observability-Logging, 10Wikimedia-Logstash, and 2 others: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335 (10colewhite) 05Open→03Resolved a:03colewhite This was resolved in T297239 - the main lo... [21:21:59] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:23:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Andrew) [21:25:41] !log brennen@deploy1002 Started deploy [phabricator/deployment@8a7d4bf]: test deploy to phab2001 [21:25:46] !log brennen@deploy1002 Finished deploy [phabricator/deployment@8a7d4bf]: test deploy to phab2001 (duration: 00m 05s) [21:28:02] !log brennen@deploy1002 Started deploy [phabricator/deployment@8a7d4bf]: test deploy to phab2001 [21:28:21] !log brennen@deploy1002 Finished deploy [phabricator/deployment@8a7d4bf]: test deploy to phab2001 (duration: 00m 19s) [21:30:13] !log brennen@deploy1002 Started deploy [phabricator/deployment@8a7d4bf]: test deploy to phab2001 [21:30:25] !log brennen@deploy1002 Finished deploy [phabricator/deployment@8a7d4bf]: test deploy to phab2001 (duration: 00m 11s) [21:32:43] (apologies for deployspam.) [21:32:51] !log brennen@deploy1002 Started deploy [phabricator/deployment@8a7d4bf]: test deploy to phab2001 [21:33:43] !log brennen@deploy1002 Finished deploy [phabricator/deployment@8a7d4bf]: test deploy to phab2001 (duration: 00m 51s) [21:46:38] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10colewhite) [21:46:41] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Ingest production logs with ELK7 - https://phabricator.wikimedia.org/T235891 (10colewhite) 05Open→03Invalid We no longer use ES. [21:48:31] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:51:17] !log brennen@deploy1002 Started deploy [phabricator/deployment@8a7d4bf]: test deploy to phab2001 [21:51:22] !log brennen@deploy1002 Finished deploy [phabricator/deployment@8a7d4bf]: test deploy to phab2001 (duration: 00m 05s) [21:53:01] !log brennen@deploy1002 Started deploy [phabricator/deployment@8a7d4bf]: test deploy to phab2001 [21:53:07] !log brennen@deploy1002 Finished deploy [phabricator/deployment@8a7d4bf]: test deploy to phab2001 (duration: 00m 05s) [21:53:56] (03PS1) 10Andrew Bogott: Move cloudweb100[12] to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/817385 (https://phabricator.wikimedia.org/T313861) [21:53:58] (03PS1) 10Andrew Bogott: Remove puppet refs to labweb100[12] [puppet] - 10https://gerrit.wikimedia.org/r/817386 (https://phabricator.wikimedia.org/T313861) [21:54:23] !log brennen@deploy1002 Started deploy [phabricator/deployment@8a7d4bf]: test deploy to phab2001 [21:54:29] !log brennen@deploy1002 Finished deploy [phabricator/deployment@8a7d4bf]: test deploy to phab2001 (duration: 00m 05s) [21:55:01] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: close down cloudweb envoy port [puppet] - 10https://gerrit.wikimedia.org/r/816171 (https://phabricator.wikimedia.org/T305414) (owner: 10Majavah) [22:02:56] !log brennen@deploy1002 Started deploy [phabricator/deployment@8a7d4bf]: test deploy to phab2001 [22:03:02] !log brennen@deploy1002 Finished deploy [phabricator/deployment@8a7d4bf]: test deploy to phab2001 (duration: 00m 05s) [22:05:59] !log brennen@deploy1002 Started deploy [phabricator/deployment@0950b61]: test deploy to phab2001 [22:06:27] !log brennen@deploy1002 Finished deploy [phabricator/deployment@0950b61]: test deploy to phab2001 (duration: 00m 27s) [22:17:52] (03PS1) 10Ebernhardson: [WIP] Set CORS headers appropriate to WCQS [puppet] - 10https://gerrit.wikimedia.org/r/817387 [22:18:20] (03PS1) 10Cwhite: logstash: add rolling strategy to json logs [puppet] - 10https://gerrit.wikimedia.org/r/817388 (https://phabricator.wikimedia.org/T166107) [22:18:53] (03CR) 10Ebernhardson: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36418/console" [puppet] - 10https://gerrit.wikimedia.org/r/817387 (owner: 10Ebernhardson) [22:21:01] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, and 2 others: Implement sensitive logstash access control - https://phabricator.wikimedia.org/T213902 (10colewhite) a:03colewhite We will try the OpenSearch security plugin. [22:21:40] (03PS2) 10Ebernhardson: [WIP] Set CORS headers appropriate to WCQS [puppet] - 10https://gerrit.wikimedia.org/r/817387 [22:22:38] (03CR) 10Ebernhardson: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36419/console" [puppet] - 10https://gerrit.wikimedia.org/r/817387 (owner: 10Ebernhardson) [22:25:24] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install netmon2002 - https://phabricator.wikimedia.org/T313867 (10RobH) [22:25:57] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install netmon2002 - https://phabricator.wikimedia.org/T313867 (10RobH) [22:26:27] 10SRE, 10Observability-Logging, 10User-fgiunchedi: Ingest webrequest sampled 1000 into logstash - https://phabricator.wikimedia.org/T301110 (10colewhite) [22:26:29] 10SRE, 10SRE-OnFire, 10Observability-Logging, 10Sustainability (Incident Followup), 10Wikimedia-Incident: create a sampled log of POST data - https://phabricator.wikimedia.org/T309186 (10colewhite) [22:26:32] 10SRE, 10Observability-Logging: Develop tooling for quickly parsing 5xx and sampled-1000 logs - https://phabricator.wikimedia.org/T292682 (10colewhite) [22:26:33] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10User-herron: Ship MX logs to ELK - https://phabricator.wikimedia.org/T197173 (10colewhite) [22:26:35] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, and 2 others: Ship host syslogs to ELK - https://phabricator.wikimedia.org/T193766 (10colewhite) [22:38:24] (03PS3) 10Ebernhardson: [WIP] Set CORS headers appropriate to WCQS [puppet] - 10https://gerrit.wikimedia.org/r/817387 (https://phabricator.wikimedia.org/T307391) [22:40:19] (03CR) 10Ebernhardson: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36420/console" [puppet] - 10https://gerrit.wikimedia.org/r/817387 (https://phabricator.wikimedia.org/T307391) (owner: 10Ebernhardson) [22:44:27] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:45:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:46:13] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:48:03] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.295 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:48:31] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48391 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:50:01] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [22:50:28] (03PS4) 10Ebernhardson: Set CORS headers appropriate to WCQS [puppet] - 10https://gerrit.wikimedia.org/r/817387 (https://phabricator.wikimedia.org/T307391) [22:54:09] (03CR) 10Ebernhardson: "Tested by copying the compiled template out of pcc and placing on the prod servers (and then undoing). Can now access from test.wikipedia." [puppet] - 10https://gerrit.wikimedia.org/r/817387 (https://phabricator.wikimedia.org/T307391) (owner: 10Ebernhardson) [22:55:16] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes202[01] - https://phabricator.wikimedia.org/T313870 (10RobH) [22:55:21] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes202[01] - https://phabricator.wikimedia.org/T313870 (10RobH) [22:56:10] (03CR) 10Ebernhardson: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36421/console" [puppet] - 10https://gerrit.wikimedia.org/r/817387 (https://phabricator.wikimedia.org/T307391) (owner: 10Ebernhardson) [22:57:26] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes202[01] implementation tracking - https://phabricator.wikimedia.org/T313871 (10RobH) [23:05:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[01] - https://phabricator.wikimedia.org/T313873 (10RobH) [23:06:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[01] - https://phabricator.wikimedia.org/T313873 (10RobH) [23:07:42] 10SRE, 10serviceops: kubernetes102[01] implemetation tracking - https://phabricator.wikimedia.org/T313874 (10RobH) [23:11:51] * Krinkle testing on mwdebug1001 [23:16:23] 10SRE, 10Epic, 10cloud-services-team (Kanban): CloudVPS: network architecture - https://phabricator.wikimedia.org/T209460 (10RobH) [23:32:05] 10SRE-Access-Requests: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10Raymond_Ndibe) [23:32:50] 10SRE-Access-Requests: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10Raymond_Ndibe) [23:59:03] !log removing one file for legal compliance [23:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log